Site Reliability Engineer, Observability

At Ripple, we’re building a world where value moves like information does today. It’s big, it’s bold, and we’re already doing it. Through our crypto solutions for financial institutions, businesses, governments and developers, we are improving the global financial system and creating greater economic fairness and opportunity for more people, in more places around the world. And we get to do the best work of our career and grow our skills surrounded by colleagues who have our backs.

If you’re ready to see your impact and unlock incredible career growth opportunities, join us, and build real world value.

Ripple Treasury, now a Ripple solution, was acquired by Ripple in 2025, marking a significant expansion into the multi-trillion-dollar corporate finance arena.

Ripple Treasury has more than 40 years of experience supporting some of the world’s largest and most sophisticated companies. Integrating its treasury command center into Ripple’s technology stack gives corporates the ability to move, manage and optimize liquidity in real-time, across traditional and digital assets, under one expanded umbrella.

Join us to build the future of corporate treasury and the infrastructure that powers the Internet of Value.

THE WORK:

As a Site Reliability Engineer you will be a force multiplier elevating engineering capabilities across observability and incident management. You will empower Ripple's stream-aligned engineering teams to detect, diagnose, and resolve production issues quickly and effectively—helping keep our products highly available, performant, and resilient at scale for customers managing trillions in annual payment volume. You will be part of Ripple's Technical Operations team, coaching teams to build comprehensive monitoring, effective alerting, and mature incident response practices. Through workshops, consultation, and hands-on guidance, you'll help teams achieve operational excellence and self-sufficiency. If you're passionate about building capabilities in others and creating lasting impact through observability and incident management, this is the opportunity for you.


WHAT YOU’LL DO:

  • Observability Enablement
    • Coach teams on instrumenting applications with structured logs, metrics, and distributed traces using New Relic and OpenTelemetry
    • Guide teams in creating effective dashboards, alerts, and SLOs/SLIs that provide actionable insights into system health and reduce Mean Time to Detection (MTTD)
    • Teach teams to define and track error budgets, using them to balance feature velocity with reliability
    • Provide hands-on guidance during production incidents to coach real-time troubleshooting using observability data
    • Develop golden path examples for instrumentation patterns, dashboard templates, and alert configurations that teams can adopt independently
    • Help teams optimize their use of New Relic (APM, Infrastructure, Logs, Synthetics) across Azure and AWS multi-cloud environments
    • Build team capability to identify and resolve performance bottlenecks, resource constraints, and degradation patterns
    Incident Management Administration & Enablement
    • Administer and configure the Incident.IO platform, ensuring it supports effective incident response workflows across all engineering teams
    • Coach teams on incident response best practices: classification, escalation, communication, coordination, and resolution
    • Help teams establish on-call rotation schedules, runbooks, and escalation policies that ensure appropriate incident coverage
    • Facilitate post-incident review (PIR) processes, teaching teams to identify root causes, document learnings, and implement preventive measures
    • Guide teams in defining incident severity levels and response procedures aligned with business impact
    • Integrate observability tooling (New Relic) with incident management (Incident.IO) to enable rapid detection and diagnosis
    • Track and report on incident metrics (MTTR, MTTD, incident frequency) and help teams drive continuous improvement
    • Facilitate incident management simulations (game days, failure injection exercises) to build team readiness
    Cross-Functional Impact
    • Enable 4-6 teams per quarter to successfully adopt improved observability or incident management practices through workshops, consultation, and hands-on guidance
    • Identify and remove operational bottlenecks in monitoring and incident response, helping teams reduce MTTR and improve reliability
    • Collaborate with the Subsystems Platform Team to translate common needs into self-service observability and incident management capabilities
    • Facilitate knowledge sharing through documentation, training materials, and communities of practice that build lasting team competence
    • Measure and track team progress on observability maturity and incident management effectiveness, demonstrating measurable improvement
    • Work across Azure (80%) and AWS (20%) environments, supporting teams operating on both Windows (80%) and Linux (20%) infrastructure

WHAT YOU'LL BRING:

  • Core SRE Experience
    • 5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering with strong focus on observability and production operations
    • Proven ability to coach and mentor engineering teams with excellent communication and teaching skills across technical and non-technical audiences
    • Consultative mindset with the ability to influence and guide teams without direct authority
    • Experience working in Agile/Scrum environments and collaborating with cross-functional teams
    Observability Expertise (Required)
    • Expert-level hands-on experience with New Relic (APM, Infrastructure Monitoring, Logs, Synthetics, Alerts) and strong proficiency writing NRQL queries for troubleshooting
    • Proven experience implementing instrumentation in application code (OpenTelemetry, Serilog, or similar frameworks)
    • Deep understanding of structured logging, metrics collection (RED/USE methods), distributed tracing, and creating effective dashboards and alerts
    • Expertise defining and implementing SLOs/SLIs and error budgets for reliability management
    • Demonstrated ability to troubleshoot complex production issues using observability data across distributed systems
    Incident Management Expertise (Required)
    • Hands-on experience with incident management platforms (Incident.IO, PagerDuty, Opsgenie, or similar)
    • Proven track record managing and facilitating production incidents from detection through resolution
    • Experience designing and implementing incident response processes, escalation policies, and on-call rotations
    • Strong facilitation skills conducting post-incident reviews (PIRs/postmortems) that drive actionable improvements
    • Understanding of incident severity classification, SLA/SLO breach procedures, and customer impact assessment
    Infrastructure & Tools Experience (Required)
    • Strong experience with Azure cloud platform (App Services, Virtual Machines, Azure SQL, networking, monitoring) and working knowledge of AWS services
    • Experience with both Windows and Linux server environments
    • Familiarity with Infrastructure as Code (Terraform) for provisioning monitoring resources
    • Experience with Azure DevOps, Octopus Deploy, and GitHub in the context of deployment visibility and change tracking
    • Understanding of how deployment practices impact observability and incident response
    Additional Valued Experience
    • Experience measuring and improving key reliability metrics (MTTR, MTTD, availability, error budgets) across engineering organizations
    • Experience building and scaling on-call practices across multiple teams
    • Background facilitating chaos engineering or game day exercises to build team resilience
    • Experience with Jira for incident tracking and workflow automation
    • Knowledge of VM-hosted SQL Server monitoring and performance optimization
    • Familiarity with FinTech compliance requirements (SOC 2, ISO 27001) and audit evidence collection
    • Experience building communities of practice around observability and incident management
    • Industry certifications such as New Relic Programmability Certification, AWS/Azure certifications, or SRE/DevOps certifications
    • Experience with scripting languages (PowerShell, Python, Bash) for automation and observability instrumentation

Other common names for this role: Senior Site Reliability Engineer, Observability Engineer, Incident Management Engineer

For positions that will be based in NY, the annual salary range for this position is below. Actual salaries may vary based on numerous factors including, among other things, an individual applicant’s experience and qualifications for the position. This range does not include equity or additional compensation, such as bonuses or commissions.
NY Annual Base Salary Range
$160,000$200,000 USD

WHO WE ARE:

Do Your Best Work

  • The opportunity to build in a fast-paced start-up environment with experienced industry leaders
  • A learning environment where you can dive deep into the latest technologies and make an impact. A professional development budget to support other modes of learning.
  • Thrive in an environment where no matter what race, ethnicity, gender, origin, or culture they identify with, every employee is a respected, valued, and empowered part of the team.
  • In-office collaboration for moments that matter is important to our culture, and we give managers and teams the flexibility to decide which 10+ days a month they come in.
  • Bi-weekly all-company meeting - business updates and ask me anything style discussion with our Leadership Team
  • We come together for moments that matter which include team offsites, team bonding activities, happy hours and more!

Take Control of Your Finances

  • Competitive salary, bonuses, and equity
  • Competitive benefits that cover physical and mental healthcare, retirement, family forming, and family support
  • Employee giving match
  • Mobile phone stipend

Take Care of Yourself

  • R&R days so you can rest and recharge
  • Generous wellness reimbursement and weekly onsite & virtual programming
  • Generous vacation policy - work with your manager to take time off when you need it
  • Industry-leading parental leave policies. Family planning benefits.
  • Catered lunches, fully-stocked kitchens with premium snacks/beverages, and plenty of fun events

Benefits listed above are for full-time employees.


Ripple is an Equal Opportunity Employer. We’re committed to building a diverse and inclusive team. We do not discriminate against qualified employees or applicants because of race, color, religion, gender identity, sex, sexual identity, pregnancy, national origin, ancestry, citizenship, age, marital status, physical disability, mental disability, medical condition, military status, or any other characteristic protected by local law or ordinance.