Senior Site Reliability Engineer (R-19383)

The Senior Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, performance, and operability of production systems across our platforms, by applying software engineering practices to operations, with a focus on automation, observability, and incident response.

Responsibilities:

  • Own and improve the reliability, availability, and performance of production services in Google Cloud (GCP).
  • Participate in incident management, including detection, triage, mitigation, escalation, and recovery.
  • Use and improve incident workflows and tooling (e.g., ServiceNow) to ensure clear ownership and timely communication.
  • Design, implement, and operate observability solutions including monitoring, logging, tracing, synthetics, and dashboards (e.g., Splunk Observability, OpenTelemetry).
  • Reduce operational toil through automation and engineering-led solutions, proactively introducing and driving SRE best practices.
  • Support on-call rotations across multiple time zones, contributing to a sustainable 24/7 support model.
  • Define, monitor, and report SLIs, SLOs, and error budgets for critical services.
  • Drive and be accountable for best-in-class service availability through SRE principles, automation, and proactive reliability engineering.
  • Essential skills and/or Certifications:

  • Bachelor’s degree in Computer Science, Information Technology or related field
  • Strong experience with cloud-native concepts and technologies, with a strong preference for Google Cloud Platform (GCP) and Kubernetes (GKE).
  • Proven experience with Site Reliability Engineering and production incident management, ideally using platforms such as ServiceNow.
  • Experience with monitoring and observability tools, including metrics, logs, traces, and synthetics (e.g., Splunk Observability, OpenTelemetry).
  • Exposure to reliability testing, resilience engineering, or cost optimisation initiatives.
  • Excellent analytical and problem-solving skills, with the ability to diagnose complex production issues quickly.
  • Software development or automation experience using Python, shell scripts, or similar languages.
  • Hands-on experience operating production cloud infrastructure at scale.
  • Experience managing multi-region, high-availability production systems with a focus on scalability, resilience, and minimising service disruption during failures.
  • Proficiency in Microsoft Office Suites Skills
  • Show an ownership mindset in everything you do; be a problem solver, be curious and be inspired to take action, be proactive, seek ways to collaborate and connect with people and teams in support of driving success.
  • Continuous growth mindset, keep learning through social experiences and relationships with stakeholders, experts, colleagues and mentors as well as widen and broaden your competencies through structural courses and programs.
  • Where applicable, fluency in English and languages relevant to the working market.