Site Reliability Engineer — Human Engineering
The Human Engineering Software team builds tools used across Apple for user studies, research participant management, health data collection, and privacy-preserving analytics. Our infrastructure spans Django backends, Kubernetes clusters (self-hosted and AWS), PostgreSQL, Redis, Kafka, Elasticsearch and a growing set of internal service integrations.
This role is engineering-forward SRE. You'll spend as much time designing systems as operating them. You'll work closely with our full-stack engineers to improve how services communicate, how we observe production behavior, and how we ship changes safely. You'll have a seat at the architecture table — we want you proposing solutions, not just implementing them.
Minimum Qualifications
BS in Computer Science, Engineering, or equivalent practical experience, with 3+ years of experience in distributed systems
Deep experience with Kubernetes in production — cluster operations, networking, storage, troubleshooting
Strong proficiency designing and operating services in AWS (EC2, EKS, RDS, S3, IAM, VPC)
Hands-on infrastructure-as-code experience (Terraform, Helm, or equivalent)
Proficiency in at least one backend language (Python, Go, or similar) — you can write production services, not just scripts
Experience with CI/CD pipeline design and GitOps workflows
Strong understanding of networking fundamentals: DNS, load balancing, TLS, firewall rules, service discovery
Excellent communication skills. You can explain a complex system to a room of engineers who didn't build it
Experience building internal automation or self-service tooling (Slack bots, CLI tools, workflow orchestration) that reduced manual operational work
Preferred Qualifications
BS in Computer Science, Engineering, or equivalent practical experience, with 5+ years of experience in distributed systems
Experience with event-driven architectures (Kafka, RabbitMQ, or similar messaging systems)
Experience with service mesh or API gateway patterns (Istio, Envoy, Kong, or similar)
Familiarity with Django/Python web applications and their operational characteristics (Celery, Gunicorn, PostgreSQL)
Experience with observability tooling beyond basic monitoring: distributed tracing, SLO frameworks, structured logging
Background working with sensitive data (health data, PII) and associated compliance requirements
Experience leading incident response and building on-call culture
Contributions to internal or open-source infrastructure tooling