Site Reliability Engineering Manager, Apple Data Platform
Apple Service Engineering (ASE) teams build and scale the platforms and infrastructure behind many of Apple's services (such as iCloud, iTunes, Siri, and Maps). We are the foundation on which Apple's software developers build the products that our customers love. We are looking for a passionate and dedicated Site Reliability Engineering Manager to provide technical leadership and build our team to help ensure our customers have the highest quality Apple Services experience.
You will be responsible for building, scaling, and mentoring a high-performing SRE team that champions SRE and SWE best practices, release engineering, and data-driven decision-making. You will establish strong cross-functional partnerships to ensure reliability and resiliency are embedded throughout the system lifecycle—from design and development to deployment and operations. Your leadership will help ensure Apple’s Data Platform services meet demanding availability, latency, resilience, and security requirements while continuously improving operational maturity. We are looking for a leader who is deeply passionate about operating mission-critical, globally distributed systems, preventing outages, learning from failures, and driving long-term reliability improvements
Minimum Qualifications
10+ years of experience in software engineering, systems engineering, or infrastructure engineering.
5+ years of experience in a management role focused on leading, hiring, developing and building teams.
Ability to weigh in on architectural decisions and align engineering execution with product and business needs
Hands-on experience with reliability engineering, SRE, or large-scale production operations.
Practical experience in Python, Golang, and/or Java.
Knowledge of the Linux Operating System, containers and virtualization, standard networking protocols, and components
Understanding of SRE principals, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts.
Experience with Cloud Computing technologies (particularly Kubernetes)
Ability to lead cross-functional collaboration and influence technical decisions across teams.
Excellent written and verbal communication skills
Preferred Qualifications
Experience in defining and operating SLO-based reliability and resiliency programs.
Strong knowledge of observability systems (metrics, logging, tracing) and qualification engineering.
Proficiency with the architecture, deployment, performance tuning, and troubleshooting of open source data analytics and processing technologies, especially Apache Spark, Flink, Trino, Druid, and/or other related software.
Working experience with AI, Large-Language Models, and other efficiency or automation tools.