Site Reliability Engineer

You will lead technical initiatives to improve system reliability, performance, and scalability. You will design and implement resilient distributed systems, build automation frameworks, and elevate observability by implementing monitoring and logging solutions. You will lead incident response, perform deep troubleshooting, run blameless post-mortems, and mentor other engineers.

Responsibilities

Lead technical initiatives to improve reliability, performance, and scalability
Design and implement resilient and scalable distributed systems
Lead complex troubleshooting and root cause analysis
Develop and promote automation frameworks and tooling
Design and implement monitoring and logging for observability
Provide technical leadership during incident response
Conduct blameless post-mortem analyses and drive improvements
Mentor and guide other SREs and engineers

Requirements

Minimum 5 years experience in SRE, platform engineering, or software development with operational focus
Proven technical leadership, guidance, or mentorship experience
Expert practical knowledge of Google Cloud Platform (GCP)
Deep hands-on experience with Kubernetes
Experience with infrastructure-as-code such as Terraform
Experience with Helm and ArgoCD
Proficiency in Python, Go, and Bash
Experience with monitoring and observability tools such as Prometheus, Grafana, and ELK
Strong analytical, problem-solving, and debugging skills
Excellent communication and collaboration skills