Senior Lead Site Reliability Engineer
Elevate your engineering prowess to unprecedented levels by joining a team of exceptionally gifted professionals and position yourself among the top echelon in site reliability.
As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the within Corporate Technology ,Compliance Technology team, you work with your fellow stakeholders to define non-functional requirements (NFRs) and availability targets for the services in your application and product lines. You will ensure those NFRs are accounted for in your products’ design and test phases, that your service level indicators are effectively measuring customer experience, and that service level objectives are defined with stakeholders and implemented in production.
Job Responsibilities
- Creates and delivers high-quality designs, roadmaps, and program charters, while designing and developing robust software solutions, CI/CD pipelines, and infrastructure automation to optimize system reliability, scalability, and performance
- Acts as a key resource and mentor for technologists, fostering a culture of site reliability, inclusion, and engineering excellence while guiding teams on best practices across cloud infrastructure, automation, and operational readiness
- Collaborates with stakeholders to design and implement observability, alerting, and reliability solutions, including SLOs/SLIs, monitoring frameworks, and incident response processes that ensure stable, scalable, and high-performing systems
- Uses enterprise-authorized AI capabilities within the work environment to accelerate reliability design and operational decisioning (e.g., incident/post-incident analysis and requirements traceability), validating outputs and handling operational data according to sensitivity and security requirements, while also leveraging modern tooling to optimize CI/CD and operational workflows.
- Drives evolution, debugging, and performance optimization of critical systems by managing cloud-native infrastructure (AWS), container platforms (Docker/Kubernetes/EKS/ECS), and understanding application dependencies and system limitations
- Provides ongoing guidance, tools, and automated solutions including infrastructure as code (Terraform/CloudFormation/CDK), environment standardization, configuration management, patching, backups, and cost optimization strategies
- Makes significant contributions to JPMorganChase’s SRE community while supporting release management, change control, on-call rotations, and continuous improvement through post-incident reviews and operational excellence practices
- Leads reuse-first adoption of AI-assisted reliability workflows across SDLC/toolchain practices (e.g., testing/validation automation and production readiness), ensuring traceability/auditability, resiliency, and security controls, while enforcing governance, security best practices (IAM, secrets management), and reliability-focused automation.
Required qualifications, capabilities, and skills
- Formal training or certification on site reliability engineering concepts and 5+ years applied experience
- Brings an advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform
Advanced knowledge and experience in observability such as white and black box monitoring, service level objectives, alerting, and telemetry collection, along with hands-on experience with monitoring tools such as Grafana, Dynatrace, Prometheus, Datadog, or Splunk
- Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve reliability engineering workflows with strong validation habits and awareness of data sensitivity, along with experience leveraging automation and modern DevOps practices.
Ability to set team practices for safe AI usage in operations (e.g., review/approval expectations and escalation paths) while maintaining resiliency, security, and auditability outcomes, including governance of secure cloud and automation practices.
- Advanced knowledge of software applications and technical processes with considerable depth in one or more technical disciplines, including AWS services (IAM, VPC, EC2, S3, RDS/Aurora, CloudWatch, EKS/ECS, Lambda, Route 53) and CI/CD tooling (GitHub Actions, Jenkins, GitLab CI, Azure DevOps)
- Demonstrated ability to communicate data-based solutions with complex reporting and visualization methods
- Recognized as an active contributor of the engineering community
Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices
Preferred qualifications, skills, and capabilities
- Familiarity with modern front-end technologies
- Experience with large-scale distributed systems
- Knowledge of networking and cloud security best practices
- Strong collaboration, communication, and stakeholder management skills
- Proactive, innovative mindset with a passion for continuous learning