Sr SRE/Dev Ops Engineer

Role Description

Madison Reed is seeking a hands-on Senior SRE / AI Platform DevOps Engineer to build, operate, and scale the infrastructure behind our AI-powered services, agents, and orchestration platforms.

This role sits at the intersection of site reliability engineering, cloud infrastructure, DevOps automation, observability, and AI operations. You will own the systems and practices that ensure our AI-enabled services are reliable, secure, scalable, cost-effective, and production-ready.

The ideal candidate is infrastructure-first and operationally minded, with deep experience in cloud environments, CI/CD, production monitoring, incident response, and automation. You will help operationalize AI systems by building reliable deployment workflows, telemetry pipelines, monitoring frameworks, and governance processes for models, agents, and orchestration services.

This is a highly hands-on engineering role for someone who enjoys building resilient platforms, reducing operational risk, improving deployment velocity, and making advanced technology dependable in real-world production environments.

The base range for this position is between $170k-175k. At Madison Reed, we aim to pay competitively. Factors which may affect starting pay within this range may include geography/market, skills, education, experience, and other qualifications of the successful candidate. This role must be based in the United States.

Key Responsibilities

Infrastructure Provisioning & Automation

  • Design, provision, and manage cloud infrastructure for AI-powered services, agents, orchestration systems, and supporting platforms.
  • Automate environment setup and configuration across development, staging, and production environments.
  • Build reusable infrastructure-as-code patterns that improve consistency, security, scalability, and maintainability.
  • Partner with engineering teams to ensure production systems are resilient, observable, performant, and cost-efficient.
  • Participate in on-call support, incident response, root cause analysis, and continuous reliability improvement.

CI/CD & Deployment Engineering

  • Build, maintain, and optimize CI/CD pipelines for services, agents, orchestration layers, and supporting infrastructure.
  • Implement automated testing, validation, security, and reliability gates within deployment workflows.
  • Design safe deployment patterns including blue/green deployments, canary releases, feature flags, and automated rollback mechanisms.
  • Integrate health checks, service readiness checks, and reliability signals into release processes.
  • Improve deployment speed and confidence while reducing production risk.

AI Platform Operations & Deployment Governance

  • Package, version, deploy, and manage AI models, agent services, and orchestration components across environments.
  • Support safe rollout, rollback, refresh, and retirement workflows for AI-powered services.
  • Monitor AI service performance across latency, throughput, availability, cost, quality, and business-critical reliability signals.
  • Implement operational controls for AI systems, including version tracking, environment promotion, access management, and change governance.
  • Partner with data, engineering, product, and support teams to ensure AI systems are production-ready and operationally accountable.

Telemetry, Observability & Data Pipelines

  • Design and operate scalable telemetry pipelines for logs, metrics, traces, model events, agent interactions, and operational signals.
  • Enable structured observability for AI services and orchestration systems to support real-time monitoring, alerting, and diagnostics.
  • Build dashboards, alerts, and reporting that provide actionable insight into system health, performance, reliability, and cost.
  • Improve incident detection, triage, and resolution through high-quality telemetry and operational data.
  • Support data-driven reliability practices, including SLOs, error budgets, service health reviews, and post-incident analysis.

AIOps Platform Integration

  • Implement intelligent monitoring, alert correlation, anomaly detection, and automated incident response capabilities.
  • Integrate AIOps tools and workflows into existing DevOps, SRE, and engineering operations.
  • Build automation that reduces manual operational work and improves mean time to detect and resolve issues.
  • Identify opportunities to use AI and automation to improve platform reliability, observability, supportability, and operational efficiency.

Production Reliability & SRE Excellence

  • Define and maintain reliability standards for AI-powered production systems.
  • Establish and track service-level indicators, service-level objectives, and operational readiness requirements.
  • Lead reliability reviews, production readiness assessments, and infrastructure risk assessments.
  • Drive improvements in system resilience, scalability, security, performance, and cost optimization.
  • Champion SRE best practices across engineering teams.

Qualifications

Required Experience

  • 5+ years of experience in DevOps, Site Reliability Engineering, Platform Engineering, Cloud Infrastructure, or related roles.
  • Strong hands-on experience with cloud infrastructure, preferably AWS.
  • Experience building and maintaining CI/CD pipelines and automated deployment workflows.
  • Proficiency with infrastructure-as-code tools such as Terraform, CloudFormation or similar.
  • Experience operating production systems with strong monitoring, alerting, logging, and incident response practices.
  • Strong scripting or programming skills in Python, Bash, Go, or similar languages.
  • Experience designing reliable, secure, scalable, and cost-conscious infrastructure.
  • Comfortable participating in on-call rotations and supporting production systems.

Preferred Experience

  • Experience operating AI, ML, agent-based, or data-intensive systems in production.
  • Familiarity with model deployment, model versioning, inference services, or MLOps workflows.
  • Experience with observability platforms such as Datadog, New Relic, Grafana, Prometheus, OpenTelemetry, Splunk, or similar.
  • Experience with event-driven architectures, queueing systems, streaming platforms, or telemetry pipelines.
  • Familiarity with AIOps concepts such as anomaly detection, alert correlation, automated remediation, and intelligent incident response.
  • Experience implementing SLOs, error budgets, production readiness reviews, and reliability scorecards.
  • Understanding of security, compliance, access control, and governance practices for production systems.

What Success Looks Like

  • AI-powered services are deployed through reliable, repeatable, and well-governed CI/CD workflows.
  • Production systems have clear monitoring, alerting, dashboards, and ownership.
  • Incidents are detected quickly, triaged efficiently, and resolved with strong post-incident learning.
  • Infrastructure is automated, scalable, secure, and cost-effective.
  • Engineering teams can deploy AI services with confidence and lower operational risk.
  • Madison Reed’s AI platform becomes increasingly reliable, observable, and operationally mature.

Big on Benefits

The Perks? Glad you asked…

  • Comprehensive Healthcare
  • 100% Company Paid Short and Long Term Disability
  • 401k Participation and Equity Grants
  • Continuing Education Contributions
  • HSA Employer Contributions and FSA Options
  • Parental Leave Program
  • Commuter Benefits
  • Responsible Paid Time Off Program
  • Complimentary Madison Reed Products + Discounts on Hair Color Bar Services
  • Company sponsored events
  • But wait, there’s more…

We are Madison Reed.

We’re disrupting a $50 billion industry.

Since 2013, we’ve offered our clients the option to truly own their beauty with a revolutionary choice—your place or ours? Home or Hair Color Bar? Our professional hair color is truly omnichannel, with the option to order or subscribe through our website, pick up in-store at our Hair Color Bars, or make an appointment at one of our Hair Color Bar locations (over 20 & growing). At our Hair Color Bars, clients can choose from a variety of color services from licensed cosmetologists—permanent hair color, roots only, hair gloss, highlights and more. With our men’s line launched in 2020, we’re shaking up the $50 billion hair care industry with products that continue to raise the bar for doing hair at home.

We live our values.

Here at our San Francisco headquarters and in every Hair Color Bar, we truly live our values—Love, Joy, Courage, Responsibility, and Trust. Our values inform everything we do, from how we treat our clients to how we treat every member of our fast-growing team. Our founder & CEO, Amy Errett, has fostered a one-of-a-kind culture based on transparency, accountability, and fun; where diversity and inclusion are of utmost importance and every team member feels supported to succeed.

We are hair color that breaks the rules.

Our commitment to the ultimate client experience, paired with our dedication to product innovation and the latest beauty technology, has attracted a devoted, consistently-growing base of fans, converts, and color evangelists. We love what we do—and it shows.

Join us in our mission to live life colorfully and make personal care more personal.

Information for Recruiters: Madison Reed only accepts resumes directly from candidates. Madison Reed does not accept unsolicited resumes from staffing vendors, including recruitment agencies and/or search firms, and does not pay fees to any such vendors for any unsolicited resumes.

Madison Reed. is an equal opportunity employer. We are committed to recruiting, training, compensating and promoting our employees regardless of race, color, religion, sex, disability, national origin, age, sexual orientation, gender or any other protected classes as required by applicable law that might make us unique or different. As a company, we are dedicated to reflecting the diversity, multiculturalism, and inclusion found in the communities we serve. Inclusion is at the heart of what we do, from the way we craft our job descriptions, to the values we espouse daily.