Site Reliability Team Lead
We are seeking a Site Reliability Team Lead to help drive the transition from a monolithic architecture to a modular, service-oriented model. In this role, you will build out capacity for production monitoring, help the organization define SLOs/SLIs and ensure reporting of key platform metrics. Working inside the Engineering Enablement team (Platform Engineering), you will also establish best practices for production release. Responsibilities Lead the design and implementation of end-to-end monitoring and alerting for Azure-based production environments Collaboration with development teams to improve deployment automation, infrastructure as code and CI/CD pipelines Management of incident response, root cause analysis and post-mortem processes Oversee migration mechanics, including cutover from manually-built infrastructure to IaC without a production outage Drive reliability best practices and advocate for SRE principles across the platform engineering organization Definition of SLOs/SLIs and ensure reporting of key platform metrics Establishment of best practices for production release Requirements 5+ years of experience in DevOps or Site Reliability Engineering Proven experience with Azure cloud services, monitoring tools and infrastructure automation Hands-on expertise in Azure, Terraform and Azure Pipelines Proficiency in App Services, Docker and K8S Skills in Azure DevOps, Grafana and Application Insights Knowledge of networking, Key Vault and Cosmos DB Familiarity with Identity and Azure Monitor Strong background in DevOps practices, CI/CD and infrastructure as code such as ARM templates Excellent troubleshooting, communication and collaboration skills English proficiency at B2 level or higher