Site Reliability Engineer

Locations and Workstyle:

Blue Bell, PA: Primarily remote; candidates should be within commuting distance of the Blue Bell office and able to work onsite as needed. Option to come onsite more frequently if desired.
Irving, TX and Boca Raton, FL: Hybrid schedule - onsite a minimum of four days per week, with one remote day. Five days onsite may be required based on business needs.

What You'll Do:

Work closely with Infrastructure and Development teams to keep the ADT platform running and customers protected, while collaborating with cross-functional partners (IT, Security, DevOps, Engineering) to improve operational health and apply SRE best practices
Support the reliability, availability, scalability, and performance of large-scale distributed systems
Drive operational excellence through problem-solving, performance improvements, and resilient production environments
Use tools such as Terraform, Ansible, Kubernetes, and Dynatrace to support mission-critical applications
Work within cloud environments (AWS, GCP) and Kubernetes-based infrastructure, with guidance on complex design decisions
Identify performance bottlenecks and reliability gaps, and implement improvements
Build and maintain infrastructure as code (Terraform, Ansible) for provisioning, configuration, patching, and releases
Contribute to observability and monitoring (Dynatrace, Prometheus), including dashboards, alerts, runbooks, and tuning
Support software releases, including validation, rollback planning, and post-change verification across ADT+ and legacy platforms
Provide production support, including on-call participation, incident response, remediation follow-through, and support for customer-impacting issues during major incidents

What You'll Need:

3+ years of experience in SRE, DevOps, platform engineering, software engineering, or related roles with production and on-call responsibility
Background in systems or operations with progression toward engineering work (automation, scripting, IaC, observability)
Focus on production operations and reliability for distributed applications
Experience with infrastructure as code (Terraform, Ansible), including building and maintaining environments
Experience working in cloud environments (AWS and/or GCP)
Familiarity with Kubernetes in production environments
Proficiency in at least one programming or scripting language (Python, Java, Bash, or similar), including working with existing codebases
Understanding of software development and change management practices
Experience with monitoring and observability tools (Dynatrace, Prometheus, or similar)
Ability to diagnose and resolve production issues with sound judgment around risk, rollback, and escalation
Experience with CI/CD pipelines and automation tools
Familiarity with incident response and post-incident follow-up
Strong communication skills and ability to collaborate across teams
Comfortable learning complex systems and seeking guidance when needed
Comfortable using AI tools to accelerate investigation, automation, and documentation while maintaining sound engineering judgment

Preferred Qualifications:

Experience with Kafka, Java/JVM ecosystems, or large customer-facing platforms
Experience with security remediation at scale (patch SLAs, CVE response, OS upgrades)
Experience working with Jira-driven workflows and cross-team escalation
Familiarity with Harness, enterprise Git workflows, and audit-driven change controls