Senior Site Reliability Engineer, Apple Data Platform Infra SRE
Apple Service Engineering (ASE) teams build and scale the platforms and infrastructure behind many of Apple’s services — including iCloud, iTunes, Siri, and Maps. We are the foundation on which Apple’s software developers build the products that our customers love. We are looking for a passionate and dedicated Technical Lead to drive SRE standards and engineering excellence across the entire Apple Data Platform organization. The Apple Data Platform (ADP) SRE Technical Lead partners with multiple SRE and engineering teams across the data platform — including teams responsible for Hadoop and HBase infrastructure, Spark, S3-compatible storage, and Airflow-orchestrated pipelines. Rather than owning a single vertical, this role sets the technical direction for how reliability is practiced across ADP: defining SLOs, establishing architectural review processes, developing shared tooling and automation, and ensuring that SRE principles are applied consistently as the platform scales. You will be a force multiplier — making every team around you more effective.
Minimum Qualifications
BS/MS in Computer Science or equivalent
12+ years of experience in Site Reliability Engineering, managing infrastructure and services at scale
5+ years of experience in technical leadership roles, with demonstrated ability to lead horizontally across teams without direct authority
Broad expertise across the data platform stack: Hadoop (HDFS, YARN), HBase, Apache Spark, Data Lake architectures, S3-compatible storage solutions, and Apache Airflow
History of defining and driving SLO/error budget frameworks and reliability practices across multiple teams or services
Demonstrable programming skills to develop shared tooling, lead code reviews, and set engineering standards
Strong written and verbal communication skills — able to present technical strategy to both engineers and leadership
Advanced knowledge of Linux, networking, and distributed systems fundamentals
Preferred Qualifications
15+ years of experience in SRE or related work managing infrastructure at scale
Experience with Ceph object storage operations
Kubernetes cluster operations experience, particularly running stateful data workloads
Experience with scale testing, disaster recovery, and capacity planning across distributed data systems
Experience driving multi-year platform migrations or large-scale architectural transitions
Ability to define the technical roadmap for a data platform organization and drive cross-functional alignment on architectural standards and best practices
Background in data security, access control, or compliance-sensitive data environments