Site Reliability Engineer, AI & Agentic Systems

Overview

As our SRE charter continues to evolve, this role demands strong hands-on ownership of production reliability and troubleshooting, coupled with advanced capabilities in AI- and agentic-driven automation and performance engineering.

The Site Reliability Engineer will play a critical role in ensuring reliability, scalability, performance, and operational excellence of our platforms. The ideal candidate will leverage Azure-native AI services and agentic systems to reduce toil, improve incident response, and enable intelligent operations—while also driving performance testing practices to validate system resilience under load.

**This is a hybrid role, located at our Plano, TX office. Candidates must be willing and able to work in-office 3 days per week in Plano, TX.

Applicants must be currently authorized to work in the United States on a full-time basis and must not require sponsorship for employment visa status now or in the future

A DAY IN THE LIFE

In this role, you will…

Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
Contribute to evolving SRE standards, tooling, operational processes, and knowledge base

Responsibilities

Reliability Engineering & Production Ownership

Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
Define, measure, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets aligned with business outcomes
Drive proactive reliability improvements based on operational insights, failure mode analysis, and capacity planning
Participate in on-call rotations and take real-time ownership during production incidents

Platform & Automation Engineering

Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR)
Implement self-healing mechanisms, automated remediation workflows, and runbook automation
Manage and optimize API lifecycle and traffic management using Gravitee API Gateway
Design and implement durable, fault-tolerant workflows and microservice orchestration patterns using Temporal
Administer and tune PostgreSQL databases for reliability, performance, and high availability
Partner with application and platform teams to improve service operability, deployment safety, and change management

Performance Testing & Load Engineering

Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
Build and maintain performance test scripts and virtual user scenarios using Micro Focus LoadRunner and VuGen (Virtual User Generator)
Analyze performance test results to identify bottlenecks, regressions, and scalability limits; produce clear reports with actionable recommendations
Integrate performance testing into CI/CD pipelines to enable continuous performance validation and shift-left testing practices
Establish and monitor performance baselines, benchmarks, and SLAs across critical service endpoints and user journeys
Collaborate with development and architecture teams to resolve performance issues and optimize system throughput, latency, and resource utilization

AI / Agentic Engineering (Azure Focus)

Design and implement AI-driven and agentic systems to enhance operational workflows and intelligent decision-making
Build intelligent automation for operational use cases, including:
- Incident triage, enrichment, and automated escalation
- Alert correlation, deduplication, and noise reduction
- Automated diagnosis and remediation of recurring failures
Leverage Azure AI services (Azure OpenAI, Cognitive Services, Azure ML) for operational intelligence and predictive insights
Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
Ensure safe, reliable, and observable operation of AI-powered systems in production, including guardrails, fallback mechanisms, and audit trails

Collaboration & Technical Leadership

Act as a reliability, performance, and automation champion across engineering teams
Mentor junior SREs and influence adoption of best practices in reliability, observability, and performance engineering
Contribute to evolving SRE standards, tooling, operational processes, and knowledge base
Participate in architecture reviews and provide guidance on non-functional requirements (reliability, scalability, performance)

Qualifications

Core SRE Skills

5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles
Strong hands-on experience in production troubleshooting of distributed systems at scale
Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning
Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS)
Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures
Proficiency in one or more programming languages: Python, Go, Java, or equivalent
Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep)

Observability & Monitoring

Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo
Experience with alerting strategies, SLI/SLO-based monitoring, and on-call incident management

Performance Testing & Load Engineering

Proven experience designing and executing performance and load testing for large-scale distributed applications
Hands-on proficiency with Micro Focus LoadRunner and VuGen for scripting virtual user scenarios, parameterization, correlation, and result analysis
Strong understanding of performance testing methodologies: load testing, stress testing, endurance/soak testing, spike testing, and capacity planning
Ability to analyze performance metrics (throughput, response time, error rate, resource utilization) and translate findings into engineering actions
Experience integrating performance tests into automated CI/CD pipelines

Platform & Middleware

Experience with Gravitee or equivalent API gateway platforms for traffic management, rate limiting, and API lifecycle governance
Hands-on experience with Temporal for workflow orchestration, durable execution, and distributed task management
Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning

AI / Agentic Systems

Hands-on experience building or integrating AI-powered automation in production environments
Experience with agent-based systems, LLM-powered workflows, Retrieval-Augmented Generation (RAG), or intelligent assistants
Familiarity with Azure-based AI and ML services (Azure OpenAI, Cognitive Services, Azure ML)
Understanding of reliability, safety, observability, and operational challenges of AI systems in production

Site Reliability Engineer, AI & Agentic Systems

Overview

Responsibilities

Qualifications

Similar jobs

SRE (Site Reliability Engineer)

Staff Software Engineer, Site Reliability Engineering, Networking

Software Engineer, Site Reliability Engineering, Google Cloud

Systems Engineer, Site Reliability Engineering, Customer Fabric Networks

Site Reliability Engineer

Site Reliability Engineer - CTJ - POLY