Lead Site Reliability Engineer
As a Site Reliability Engineering at JPMorgan Chase within the Enterprise technology, liquidity risk team, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.
Job responsibilities
- Lead SRE practices that balance delivery speed, efficiency, and system stability
- Partner with engineering peers and senior stakeholders to drive strong, shared outcomes
- Scale SRE adoption across application and platform teams
- Set reliability expectations and show progress through stability and reliability metrics
- Run blameless, data-driven post-incident reviews and regular debriefs to turn lessons into improvements
- Build a continuous-improvement culture by gathering feedback and improving the customer experience
Coach entry- to mid-level engineers and promote knowledge sharing through internal forums and communities
Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., CI/CD quality checks, test/validation automation, and operational readiness), ensuring traceability/auditability, resiliency, and security controls.
Required qualifications, capabilities, and skills
- Formal training or certification in software engineering concepts plus 5+ years of applied experience
- Advanced knowledge of SRE principles and a track record of implementing SRE across application and platform teams while avoiding common pitfalls
- Experience leading technologists to manage and resolve complex technology issues at a firmwide level
- Ability to influence team culture by championing innovation and driving change
- Experience hiring, developing, and recognizing talent
- Proficiency in at least one programming language (preferred: JavaScript, Go, Python)
- Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab, Terraform)
Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS)
Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity.
Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations.
Preferred qualifications, capabilities, and skills
- Ability to code, troubleshoot, and demonstrate strong data fluency
- Strong troubleshooting skills across common networking technologies and issues
- Working knowledge of modern service and integration patterns, including GraphQL fundamentals, event-driven architecture (Kafka or equivalent), and observability/telemetry with OpenTelemetry