Site Reliability Engineer III

SEIII / SRE III

As a Software Engineer II on the Site Reliability Engineering (SRE) team, you will contribute to the design, automation and operation of large-scale, cloud-based systems that power EA’s global gaming platform. You will work closely with senior engineers to enhance service reliability, scalability and performance across multiple game studios and services.

Responsibilities:

Build and Operate Scalable Systems: Support the development, deployment, and maintenance of distributed, cloud-based infrastructure leveraging modern open-source technologies (AWS/GCP/Azure, Kubernetes, Terraform, Docker, etc.).
Platform Operations and Automation: Develop automation scripts, tools, and workflows to reduce manual effort, improve system reliability, and optimize infrastructure operations (reducing MTTD and MTTR).
Monitoring, Alerting & Incident Response: Create and maintain dashboards, alerts, and metrics to improve system visibility and proactively identify issues. Participate in on-call rotations and assist in incident response and root cause analysis.
Continuous Integration / Continuous Deployment (CI/CD): Contribute to the design, implementation, and maintenance of CI/CD pipelines to ensure consistent, repeatable, and reliable deployments.
Reliability and Performance Engineering: Collaborate with cross-functional teams to identify reliability bottlenecks, define SLIs/SLOs/SLAs, and implement improvements that enhance the stability and performance of production services.
Post-Incident Reviews & Documentation: Participate in root cause analyses, document learnings, and contribute to preventive measures to avoid recurrence of production issues. Maintain detailed operational documentation and runbooks.
Collaboration & Mentorship: Work closely with senior SREs and software engineers to gain exposure to large-scale systems, adopt best practices, and gradually take ownership of more complex systems and initiatives.
Modernization & Continuous Improvement: Contribute to ongoing modernization efforts by identifying areas for improvement in automation, monitoring, and reliability.

Qualifications – Software Engineer II (Site Reliability Engineer)

7+ years of experience in Cloud Computing (AWS preferred), Virtualization, and Containerization using Kubernetes, Docker, or VMWare. And Extensive hands-on experience in container orchestration technologies, such as EKS, Kubernetes, Docker
Experience supporting production-grade, high-availability systems with defined SLIs/SLOs.
Strong Linux/Unix administration and networking fundamentals (protocols, load balancing, DNS, firewalls).
Hands-on experience with Infrastructure as Code and automation tools such as Terraform, Helm, Ansible, or Chef..
Proficiency in Python, Golang, Bash, or Java for scripting and automation.
Familiar with monitoring and observability tools like Prometheus, Grafana, Loki, or Datadog.
Exposure to distributed systems, SQL/NoSQL databases, and CI/CD pipelines.
Strong problem-solving, troubleshooting, and collaboration skills in cross-functional environments.