Lead Site Reliability Engineer

As a Site Reliability Engineering at JPMorgan Chase within the Enterprise technology, liquidity risk team, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.

Job responsibilities

Lead SRE practices that balance delivery speed, efficiency, and system stability
Partner with engineering peers and senior stakeholders to drive strong, shared outcomes
Scale SRE adoption across application and platform teams
Set reliability expectations and show progress through stability and reliability metrics
Run blameless, data-driven post-incident reviews and regular debriefs to turn lessons into improvements
Build a continuous-improvement culture by gathering feedback and improving the customer experience
Coach entry- to mid-level engineers and promote knowledge sharing through internal forums and communities
Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., CI/CD quality checks, test/validation automation, and operational readiness), ensuring traceability/auditability, resiliency, and security controls.

Required qualifications, capabilities, and skills

Formal training or certification in software engineering concepts plus 5+ years of applied experience
Advanced knowledge of SRE principles and a track record of implementing SRE across application and platform teams while avoiding common pitfalls
Experience leading technologists to manage and resolve complex technology issues at a firmwide level
Ability to influence team culture by championing innovation and driving change
Experience hiring, developing, and recognizing talent
Proficiency in at least one programming language (preferred: JavaScript, Go, Python)
Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab, Terraform)
Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS)
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.

Preferred qualifications, capabilities, and skills

Ability to code, troubleshoot, and demonstrate strong data fluency
Strong troubleshooting skills across common networking technologies and issues
Working knowledge of modern service and integration patterns, including GraphQL fundamentals, event-driven architecture (Kafka or equivalent), and observability/telemetry with OpenTelemetry

Lead Site Reliability Engineer

Similar jobs

Lead Site Reliability Engineer Market Risk

Lead Site Reliability Engineer

Lead Site Reliability Engineer

Senior Lead Site Reliability Engineer

Senior Lead Site Reliability Engineer

Lead Software Engineer - Java/Spring