SRE Engineering Manager
As a hands-on SRE Manager, you’ll lead by example—actively driving operational excellence, contributing to code, and ensuring system reliability. You will be deeply involved in incident response across complex, distributed data platforms designed to support data exploration, analytics, and reporting solutions. These platforms operate at the unique intersection of high data volume and hybrid infrastructure, spanning both cloud and on-premise environments. We are looking for a collaborative and innovative leader who thrives under tight deadlines, excels at solving complex problems, and consistently delivers high-quality, forward-thinking solutions.
Minimum Qualifications
Bachelor’s degree or equivalent, with 10+ years of experience in the SRE domain and at least 3 years in a management role focused on leading, hiring, developing and building teams
Hands-on experience building, supporting/maintaining applications. large scale distributed systems in cloud or hybrid environments
Strong knowledge of cloud infrastructure & services (e.g., AWS, GCP, Kubernetes), Observability tools (e.g: Prometheus, Grafana, CloudWatch)
Strong Programming experience in one of the programming languages - Python or Java or Scala
Proven ability to lead incident response, perform root cause analysis, and drive system reliability improvements.
Able to lead across organizational boundaries and diverse reporting structures.
Preferred Qualifications
Hands-on experience supporting enterprise data systems on distributed architectures
Expertise in cloud-native services, including ETL frameworks (Apache Spark, Flink), and messaging systems (Kafka)
Exposure to data visualization tools such as Tableau, Business Objects, ThoughtSpot, with experience supporting and troubleshooting issues related to dashboards and reports
Experience with modern & distributed databases such as Snowflake, Cassandra, SingleStore, and SAP HANA
Experience using GenAI or automation tools for issue detection, alerting, or remediation
Solid understanding of system design, data structures, and incident management best practices