Site Reliability Engineer

Locations and Workstyle:

Blue Bell, PA: Primarily remote; candidates should be within commuting distance of the Blue Bell office and able to work onsite as needed. Option to come onsite more frequently if desired.
Irving, TX and Boca Raton, FL: Hybrid schedule - onsite a minimum of four days per week, with one remote day. Five days onsite may be required based on business needs.

What You'll Do:

  • Work closely with Infrastructure and Development teams to keep the ADT platform running and customers protected, while collaborating with cross-functional partners (IT, Security, DevOps, Engineering) to improve operational health and apply SRE best practices
  • Support the reliability, availability, scalability, and performance of large-scale distributed systems
  • Drive operational excellence through problem-solving, performance improvements, and resilient production environments
  • Use tools such as Terraform, Ansible, Kubernetes, and Dynatrace to support mission-critical applications
  • Work within cloud environments (AWS, GCP) and Kubernetes-based infrastructure, with guidance on complex design decisions
  • Identify performance bottlenecks and reliability gaps, and implement improvements
  • Build and maintain infrastructure as code (Terraform, Ansible) for provisioning, configuration, patching, and releases
  • Contribute to observability and monitoring (Dynatrace, Prometheus), including dashboards, alerts, runbooks, and tuning
  • Support software releases, including validation, rollback planning, and post-change verification across ADT+ and legacy platforms
  • Provide production support, including on-call participation, incident response, remediation follow-through, and support for customer-impacting issues during major incidents

What You'll Need:

  • 3+ years of experience in SRE, DevOps, platform engineering, software engineering, or related roles with production and on-call responsibility
  • Background in systems or operations with progression toward engineering work (automation, scripting, IaC, observability)
  • Focus on production operations and reliability for distributed applications
  • Experience with infrastructure as code (Terraform, Ansible), including building and maintaining environments
  • Experience working in cloud environments (AWS and/or GCP)
  • Familiarity with Kubernetes in production environments
  • Proficiency in at least one programming or scripting language (Python, Java, Bash, or similar), including working with existing codebases
  • Understanding of software development and change management practices
  • Experience with monitoring and observability tools (Dynatrace, Prometheus, or similar)
  • Ability to diagnose and resolve production issues with sound judgment around risk, rollback, and escalation
  • Experience with CI/CD pipelines and automation tools
  • Familiarity with incident response and post-incident follow-up
  • Strong communication skills and ability to collaborate across teams
  • Comfortable learning complex systems and seeking guidance when needed
  • Comfortable using AI tools to accelerate investigation, automation, and documentation while maintaining sound engineering judgment

Preferred Qualifications:

  • Experience with Kafka, Java/JVM ecosystems, or large customer-facing platforms
  • Experience with security remediation at scale (patch SLAs, CVE response, OS upgrades)
  • Experience working with Jira-driven workflows and cross-team escalation
  • Familiarity with Harness, enterprise Git workflows, and audit-driven change controls