Site Reliability Engineer III

SEIII / SRE III

As a Software Engineer II on the Site Reliability Engineering (SRE) team, you will contribute to the design, automation and operation of large-scale, cloud-based systems that power EA’s global gaming platform. You will work closely with senior engineers to enhance service reliability, scalability and performance across multiple game studios and services.

Responsibilities:

  • Build and Operate Scalable Systems: Support the development, deployment, and maintenance of distributed, cloud-based infrastructure leveraging modern open-source technologies (AWS/GCP/Azure, Kubernetes, Terraform, Docker, etc.).
  • Platform Operations and Automation: Develop automation scripts, tools, and workflows to reduce manual effort, improve system reliability, and optimize infrastructure operations (reducing MTTD and MTTR).
  • Monitoring, Alerting & Incident Response: Create and maintain dashboards, alerts, and metrics to improve system visibility and proactively identify issues. Participate in on-call rotations and assist in incident response and root cause analysis.
  • Continuous Integration / Continuous Deployment (CI/CD): Contribute to the design, implementation, and maintenance of CI/CD pipelines to ensure consistent, repeatable, and reliable deployments.
  • Reliability and Performance Engineering: Collaborate with cross-functional teams to identify reliability bottlenecks, define SLIs/SLOs/SLAs, and implement improvements that enhance the stability and performance of production services.
  • Post-Incident Reviews & Documentation: Participate in root cause analyses, document learnings, and contribute to preventive measures to avoid recurrence of production issues. Maintain detailed operational documentation and runbooks.
  • Collaboration & Mentorship: Work closely with senior SREs and software engineers to gain exposure to large-scale systems, adopt best practices, and gradually take ownership of more complex systems and initiatives.
  • Modernization & Continuous Improvement: Contribute to ongoing modernization efforts by identifying areas for improvement in automation, monitoring, and reliability.


Qualifications – Software Engineer II (Site Reliability Engineer)

  • 7+ years of experience in Cloud Computing (AWS preferred), Virtualization, and Containerization using Kubernetes, Docker, or VMWare. And Extensive hands-on experience in container orchestration technologies, such as EKS, Kubernetes, Docker
  • Experience supporting production-grade, high-availability systems with defined SLIs/SLOs.
  • Strong Linux/Unix administration and networking fundamentals (protocols, load balancing, DNS, firewalls).
  • Hands-on experience with Infrastructure as Code and automation tools such as Terraform, Helm, Ansible, or Chef..
  • Proficiency in Python, Golang, Bash, or Java for scripting and automation.
  • Familiar with monitoring and observability tools like Prometheus, Grafana, Loki, or Datadog.
  • Exposure to distributed systems, SQL/NoSQL databases, and CI/CD pipelines.
  • Strong problem-solving, troubleshooting, and collaboration skills in cross-functional environments.