Site Reliability Engineer, AI & Agentic Systems

Overview

As our SRE charter continues to evolve, this role demands strong hands-on ownership of production reliability and troubleshooting, coupled with advanced capabilities in AI- and agentic-driven automation and performance engineering.

The Site Reliability Engineer will play a critical role in ensuring reliability, scalability, performance, and operational excellence of our platforms. The ideal candidate will leverage Azure-native AI services and agentic systems to reduce toil, improve incident response, and enable intelligent operations—while also driving performance testing practices to validate system resilience under load.

**This is a hybrid role, located at our Plano, TX office. Candidates must be willing and able to work in-office 3 days per week in Plano, TX.

Applicants must be currently authorized to work in the United States on a full-time basis and must not require sponsorship for employment visa status now or in the future

A DAY IN THE LIFE

In this role, you will…

  • Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
  • Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
  • Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
  • Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
  • Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
  • Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
  • Contribute to evolving SRE standards, tooling, operational processes, and knowledge base

Responsibilities

Reliability Engineering & Production Ownership

  • Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
  • Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
  • Define, measure, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets aligned with business outcomes
  • Drive proactive reliability improvements based on operational insights, failure mode analysis, and capacity planning
  • Participate in on-call rotations and take real-time ownership during production incidents

Platform & Automation Engineering

  • Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
  • Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
  • Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR)
  • Implement self-healing mechanisms, automated remediation workflows, and runbook automation
  • Manage and optimize API lifecycle and traffic management using Gravitee API Gateway
  • Design and implement durable, fault-tolerant workflows and microservice orchestration patterns using Temporal
  • Administer and tune PostgreSQL databases for reliability, performance, and high availability
  • Partner with application and platform teams to improve service operability, deployment safety, and change management

Performance Testing & Load Engineering

  • Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
  • Build and maintain performance test scripts and virtual user scenarios using Micro Focus LoadRunner and VuGen (Virtual User Generator)
  • Analyze performance test results to identify bottlenecks, regressions, and scalability limits; produce clear reports with actionable recommendations
  • Integrate performance testing into CI/CD pipelines to enable continuous performance validation and shift-left testing practices
  • Establish and monitor performance baselines, benchmarks, and SLAs across critical service endpoints and user journeys
  • Collaborate with development and architecture teams to resolve performance issues and optimize system throughput, latency, and resource utilization

AI / Agentic Engineering (Azure Focus)

  • Design and implement AI-driven and agentic systems to enhance operational workflows and intelligent decision-making
  • Build intelligent automation for operational use cases, including:
    • Incident triage, enrichment, and automated escalation
    • Alert correlation, deduplication, and noise reduction
    • Automated diagnosis and remediation of recurring failures
  • Leverage Azure AI services (Azure OpenAI, Cognitive Services, Azure ML) for operational intelligence and predictive insights
  • Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
  • Ensure safe, reliable, and observable operation of AI-powered systems in production, including guardrails, fallback mechanisms, and audit trails

Collaboration & Technical Leadership

  • Act as a reliability, performance, and automation champion across engineering teams
  • Mentor junior SREs and influence adoption of best practices in reliability, observability, and performance engineering
  • Contribute to evolving SRE standards, tooling, operational processes, and knowledge base
  • Participate in architecture reviews and provide guidance on non-functional requirements (reliability, scalability, performance)

Qualifications

Core SRE Skills

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles
  • Strong hands-on experience in production troubleshooting of distributed systems at scale
  • Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning
  • Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS)
  • Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures
  • Proficiency in one or more programming languages: Python, Go, Java, or equivalent
  • Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep)

Observability & Monitoring

  • Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo
  • Experience with alerting strategies, SLI/SLO-based monitoring, and on-call incident management

Performance Testing & Load Engineering

  • Proven experience designing and executing performance and load testing for large-scale distributed applications
  • Hands-on proficiency with Micro Focus LoadRunner and VuGen for scripting virtual user scenarios, parameterization, correlation, and result analysis
  • Strong understanding of performance testing methodologies: load testing, stress testing, endurance/soak testing, spike testing, and capacity planning
  • Ability to analyze performance metrics (throughput, response time, error rate, resource utilization) and translate findings into engineering actions
  • Experience integrating performance tests into automated CI/CD pipelines

Platform & Middleware

  • Experience with Gravitee or equivalent API gateway platforms for traffic management, rate limiting, and API lifecycle governance
  • Hands-on experience with Temporal for workflow orchestration, durable execution, and distributed task management
  • Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning

AI / Agentic Systems

  • Hands-on experience building or integrating AI-powered automation in production environments
  • Experience with agent-based systems, LLM-powered workflows, Retrieval-Augmented Generation (RAG), or intelligent assistants
  • Familiarity with Azure-based AI and ML services (Azure OpenAI, Cognitive Services, Azure ML)
  • Understanding of reliability, safety, observability, and operational challenges of AI systems in production

Similar jobs