Senior Site Reliability Engineer, Core AI Infrastructure

You will own the reliability and automation of AI infrastructure. You will monitor services respond to incidents perform root cause analysis with blameless retros and provide on call support for AWS deployment pipelines. You will build automation to streamline operational IT workflows across CI CD frameworks and Kubernetes environments. You will partner with the infrastructure team to extend CI CD frameworks supporting IT services and enterprise network platforms and with security and compliance to integrate surveillance tooling into deployment pipelines. You will strengthen observability and documentation standards and develop full stack applications powering internal AI products and infrastructure with Go or Python.

Responsibilities

Own the reliability monitoring and incident response lifecycle for AI infrastructure services including on-call support for AWS deployment pipelines root cause analysis and blameless retros
Build automation and tooling to streamline operational IT workflows eliminate manual tasks and improve deployment velocity across CI CD frameworks and Kubernetes environments
Partner with the infrastructure team to extend CI CD frameworks supporting IT services and enterprise network platforms and with security and compliance to integrate surveillance tooling into deployment pipelines
Strengthen observability and documentation standards across IT engineering by defining metrics implementing monitoring solutions and maintaining technical documentation that sets a standard of excellence
Develop full stack applications that power internal AI products and infrastructure with Go or Python

Requirements

5+ years of experience automating and supporting cloud infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt)
Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environments
Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines
Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements
Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality

Benefits

Equity and bonus eligibility
Medical dental and vision insurance
401(k)
Remote first work arrangement

Senior Site Reliability Engineer, Core AI Infrastructure

Responsibilities

Requirements

Benefits

Similar jobs

Senior Software Engineer - AI Platform Team

Senior Software Engineer Infra Compute Platform

Senior Site Reliability Engineer, Data Infrastructure

Site Reliability Engineering Technical Leader (Data Center Network Services)

Senior Software Engineer (EAA)

Infrastructure Security Engineer