Site Reliability Engineer - Core

You will develop deep understanding of product infrastructure and design solutions to improve reliability, latency, bandwidth and security. You will build observability, monitoring and alerting, and create tooling to replace manual repetitive work. You will coordinate and centralize work across teams, educate developer teams on secure scalable delivery, and use cloud, container and infrastructure as code tools to implement and operate reliable production systems.

Responsibilities

  • Evolve infrastructure to improve reliability, latency, bandwidth and security
  • Improve observability, monitoring and alerting across the platform
  • Coordinate work across teams to ensure efficient execution
  • Centralize duplicated streams of work
  • Write tooling to replace manual repetitive work
  • Operate in a fast-paced dynamic environment

Requirements

  • Experience with containerization and service orchestration
  • Familiarity with Hashicorp Nomad, Consul and Vault
  • Strong knowledge of at least one programming language (Go, Python, Bash)
  • Linux systems administration and internals knowledge
  • Experience with cloud platforms (GCP or AWS)
  • Experience with monitoring tools such as Prometheus, Datadog, Grafana, Telegraf
  • Experience with infrastructure as code and complex Terraform deployments
  • Background with configuration management tools (Saltstack)
  • Experience using GitOps and CI, preferably GitHub Actions
  • Experience with messaging systems such as Kafka
  • Experience with database management
  • Experience working in Data Centers
  • Knowledge of routing and switching protocols

Benefits

  • Meaningful equity
  • Hybrid work model with office in the heart of London and remote work
  • Unlimited vacation policy
  • Work from Anywhere policy up to 20 days per year
  • Apple equipment

Similar jobs