SRE - Bangalore
Zensar Technologies Limited is looking for a talented and motivated Site Reliability Engineer (SRE) to join our team in Bangalore. This long-term contract role offers a hybrid work arrangement, with a flexible shift schedule to accommodate future changes. As an SRE, you will play a crucial role in driving the reliability and performance of our collaboration services, ensuring a seamless experience for our users. Your expertise in Kubernetes and AWS will be instrumental in optimizing our cloud and hybrid environments.
- Own the deployment and operation of critical collaboration services, ensuring high availability and scalability across cloud and hybrid environments.
- Design and optimize CI/CD pipelines and automation, incorporating AI-first tooling for efficient deployment, monitoring, and incident response.
- Lead incident response for complex production issues, conducting thorough root cause analysis and implementing systemic improvements.
- Utilize observability data to guide capacity planning, scaling strategies, and resource optimization for optimal service performance.
- Define and promote operational best practices, ensuring high-quality documentation and fostering a culture of reliability and operational excellence.
- Collaborate with cross-functional teams to align on service requirements and ensure seamless integration with existing systems.
- Stay updated with the latest industry trends and technologies, and propose innovative solutions to enhance our SRE practices.
- Mentor and guide junior team members, sharing your expertise and fostering a culture of continuous learning and improvement.
- Participate in on-call rotations and provide timely support during critical incidents, ensuring swift resolution and minimal impact on services.
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent professional experience with a minimum of 7 years in Site Reliability Engineering, Cloud Operations, or Systems Engineering.
- Strong hands-on experience operating production services using Docker and Kubernetes in cloud or hybrid environments, demonstrating a track record of successful deployments and operations.
- Proficiency in programming or scripting languages such as Python, Go, or Bash, with the ability to develop automation and operational tooling.
- Experience with monitoring, observability, and incident response in production environments, including participation in on-call duties and post-incident reviews.
- Working knowledge of Linux systems, networking, distributed systems, CI/CD pipelines, infrastructure-as-code principles, and Git-based workflows.
- Familiarity with large-scale, globally distributed SaaS platforms is preferred, as is experience with hybrid cloud environments and multi-region deployments.
- Ability to apply AI-assisted or automation-first approaches to SRE tooling and workflows, enhancing efficiency and reliability.
- Excellent written communication skills for creating clear and concise operational documentation, runbooks, and knowledge-sharing materials.
- Strong problem-solving and analytical skills, with the ability to troubleshoot complex issues and propose innovative solutions.
- A collaborative mindset and a passion for continuous learning and improvement, with a willingness to share knowledge and mentor team members.