Site Reliability Engineer
You will lead technical initiatives to improve system reliability, performance, and scalability. You will design and implement resilient distributed systems, build automation frameworks, and elevate observability by implementing monitoring and logging solutions. You will lead incident response, perform deep troubleshooting, run blameless post-mortems, and mentor other engineers.
Responsibilities
- Lead technical initiatives to improve reliability, performance, and scalability
- Design and implement resilient and scalable distributed systems
- Lead complex troubleshooting and root cause analysis
- Develop and promote automation frameworks and tooling
- Design and implement monitoring and logging for observability
- Provide technical leadership during incident response
- Conduct blameless post-mortem analyses and drive improvements
- Mentor and guide other SREs and engineers
Requirements
- Minimum 5 years experience in SRE, platform engineering, or software development with operational focus
- Proven technical leadership, guidance, or mentorship experience
- Expert practical knowledge of Google Cloud Platform (GCP)
- Deep hands-on experience with Kubernetes
- Experience with infrastructure-as-code such as Terraform
- Experience with Helm and ArgoCD
- Proficiency in Python, Go, and Bash
- Experience with monitoring and observability tools such as Prometheus, Grafana, and ELK
- Strong analytical, problem-solving, and debugging skills
- Excellent communication and collaboration skills