Site Reliability Engineer (SRE), London
We're looking for a hardworking and passionate SRE Engineer to join this amazing team. You will be an accomplished builder and problem-solver, eager to tackle challenging technical problems. You have a deep understanding of SRE principles and the expertise required to operate services at Apple scale with a high degree of operational excellence.
This role will allow you to directly contribute to shaping the future of how we build and run our services on a global scale. You will possess strong technical skills to dive deep into complex systems while also understanding and contributing to higher-level business and product goals. We seek high-quality engineers with a diverse set of experiences and skill sets. Our customers count on us to provide extraordinary availability, scalability, and security for services. If you’d like to positively influence millions of customers’ experience of Apple through your technical contributions, this is the job for you.
Minimum Qualifications
In depth hands-on experience operating managed Kubernetes (GKE and/or EKS) in a public cloud, with experience scaling distributed systems. Experience across multiple public clouds (GCP and AWS) strongly preferred.
Strong experience with deploying, supporting and supervising new and existing services, platforms and application stacks
Experience with scale testing, disaster recovery, and capacity planning
Passion for eliminating repetitive manual processes using automation to improve them through repeated iteration
Confirmed ability to write programs using a high-level programming language like: Java, Swift, Python, or TypeScript
Proclivity towards efficient programming emphasizing improvement via complexity analysis.
Experience with Nginx, Envoy, Prometheus, and/or Docker.
Preferred Qualifications
Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies.
Understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals.
Experience with Infrastructure-as-Code and config-as-code tooling such as Pulumi, Terraform, or Pkl.
Experience with fleet/cluster lifecycle management, node provisioning, and hardware-adjacent reliability (e.g., GPU health, capacity management) at scale.
Experience building and operating CI/CD pipelines for cloud infrastructure.