Principal Observability & Reliability Architect

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Principal Observability & Reliability Architect based in United States.

This is a senior, client-facing architecture leadership role focused on designing and scaling enterprise observability and reliability solutions for complex digital environments. In this position, you will help organizations improve system visibility, operational resilience, incident response maturity, and end-to-end service reliability. You will operate at the intersection of strategy and hands-on architecture, guiding clients through observability transformation journeys across monitoring, telemetry pipelines, and SRE practices. The role combines deep technical expertise with consulting leadership, spanning solution design, delivery governance, and executive engagement. You will also contribute to practice development by building reusable frameworks, accelerators, and reference architectures. This is a high-impact opportunity to shape how large enterprises achieve operational excellence in modern cloud and hybrid ecosystems.

Accountabilities:

  • Lead discovery sessions, architecture workshops, and solution design activities across observability, reliability, telemetry, and operational intelligence programs for enterprise clients.
  • Design end-to-end observability architectures covering monitoring, logging, metrics, tracing, event correlation, alerting, telemetry pipelines, and platform integrations across hybrid and multi-cloud environments.
  • Define and enforce enterprise standards for telemetry governance, including naming conventions, tagging, RBAC, data quality, retention, sampling, cost optimization, and service ownership models.
  • Guide modernization initiatives such as tool consolidation, dashboard and alert rationalization, migration from legacy monitoring systems, and implementation of scalable observability platforms.
  • Establish and mature SRE practices including SLIs, SLOs, error budgets, production readiness reviews, and incident response frameworks to improve operational reliability.
  • Design integration patterns across ITSM, CMDB, event management, automation, and incident response platforms to ensure seamless operational workflows.
  • Support pre-sales and pursuit activities by shaping solution strategy, validating scope, developing estimates, and creating client-facing technical narratives.
  • Act as a senior escalation point during delivery, providing architecture governance, risk mitigation guidance, and technical oversight across engagements.
  • Develop reusable assets including reference architectures, playbooks, governance models, and accelerators while mentoring architects, consultants, and delivery teams.
  • Requirements:

    • 10+ years of experience in observability, platform operations, SRE, monitoring, APM, or related enterprise infrastructure domains, including 5+ years in architecture or technical leadership roles.
    • Strong hands-on expertise designing and implementing observability solutions across metrics, logs, traces, telemetry pipelines, and distributed systems in cloud and hybrid environments.
    • Deep understanding of telemetry governance frameworks, including data normalization, enrichment, routing, retention strategies, access control, and cost optimization.
    • Proven ability to define enterprise standards for dashboards, alerts, service tagging, naming conventions, RBAC, and operational maturity models.
    • Strong SRE background with practical experience implementing SLIs, SLOs, error budgets, incident response processes, and production reliability practices.
    • Experience integrating observability platforms with ITSM and operational tools such as ServiceNow, PagerDuty, Jira Service Management, or similar ecosystems.
    • Consulting or professional services experience with strong client-facing communication, workshop facilitation, estimation, and cross-functional leadership skills.
    • Ability to translate complex technical challenges into clear, actionable architecture and delivery plans for both technical and executive audiences.
    • Experience with platforms such as Datadog, Dynatrace, Splunk, Grafana, New Relic, Prometheus, or OpenTelemetry is highly desirable.
    • Familiarity with telemetry pipeline tools such as Kafka, Fluent Bit, OpenTelemetry Collector, or similar technologies is a strong plus.
    • Experience building reusable consulting assets such as reference architectures, accelerators, and governance frameworks is preferred.
    • Benefits:

      • Competitive On-Target Earnings (OTE) package including base salary and performance-based incentives, determined by experience and location.
      • Comprehensive medical, dental, and vision insurance.
      • 401(k) retirement savings plan.
      • Paid time off, company holidays, and parental/caregiver leave.
      • Flexible work environment supporting collaboration and autonomy.
      • Access to advanced technology environments, innovation labs, and continuous learning opportunities.
      • Certification reimbursement and professional development support.
      • Inclusive culture focused on diversity, collaboration, and employee resource communities.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1