AI/LLM SRE Lead Software Engineer

Job description
We have an opportunity to impact your career and provide an adventure where you can push the limits of what's possible.

As a Lead Software Engineer at JPMorgan Chase within the Employee Platforms team, you will be responsible for ensuring the reliability, scalability, and automation of AI-powered applications and infrastructure. You will partner with engineering, and other stakeholders to deliver modern observability, intelligent incident response, and autonomic operations across our applications.

Job responsibilities

  • Ensure reliability, scalability, and performance of AI-assisted application and platform operations.
  • Design and implement AI-driven solutions for intelligent alerting, noise reduction & auto-correlation systems.
  • Build and maintain observability, monitoring, and telemetry for AI application and platforms.
  • Build and support automation for alerting, anomaly detection, and self-healing workflows.
  • Collaborate with engineering, and other stakeholders to drive operational excellence.
  • Mentor and guide engineers on AIOps standards and operational excellence.
  • Define and execute the roadmap for AI-assisted SRE and observability.
  • Drives team adoption of enterprise-authorized AI-assisted engineering practices within the work environment to improve code quality, delivery speed, and operational outcomes (e.g., AI-assisted code review/refactoring, test strategy acceleration, incident/root-cause analysis support), while establishing consistent validation standards (secure coding, peer review, automated testing) and promoting reuse of effective patterns across the team
  • Applies knowledge of tools within the Software Development Life Cycle toolchain, including enterprise-authorized AI-assisted development and automation capabilities, to improve the value realized by automation

Required qualifications, capabilities, and skills:

  • Formal training or certification on software engineering concepts and 5+ years applied experience
  • Demonstrates strong experience in SRE, DevOps, or Platform Engineering roles.
  • Strong hands-on experience with AWS (ECS, Lambda, API Gateway, Bedrock, CloudWatch, RDS, EKS).
  • Hands-on experience with AWS Bedrock, OpenAI, or LLM APIs.
  • Expertise in observability tools: OpenTelemetry, Grafana, Prometheus, ELK, CloudWatch.
  • Experience with CI/CD tools (GitHub Actions, Jenkins, Spinnaker ).
  • Proven track record in automation, operational tooling, and event-driven workflows.
  • In-depth understanding of distributed systems, microservices, and cloud architectures.
  • Demonstrated experience leading effective use of approved AI-assisted software development tools (e.g., for coding, code review, test acceleration, troubleshooting) with the ability to set team expectations for validating AI outputs for correctness, performance, and security
  • Strong understanding of responsible AI use in engineering workflows, including data sensitivity considerations, secure handling of inputs/outputs, and adherence to resiliency and security expectations; experience coaching engineers on safe, compliant adoption within delivery practices

Preferred qualifications, capabilities, and skills:

  • Experience with AI-powered coding assistants like GitHub Copilot, windsurf.
  • Familiarity with prompt engineering, embeddings, and RAG pipelines.
  • Experience building operational copilots or chatbots for runbooks or troubleshooting.
  • Proficiency in Python (Go is a plus).

Similar jobs