Agentic Infrastructure Observability Engineer

\u200b<\/span>
<\/div>

ABOUT XENONSTACK<\/b>
<\/h3>

XenonStack is the fastest\-growing <\/span>Data and AI Foundry for Agentic Systems<\/b>, enabling enterprises to gain <\/span>real\-time and intelligent business insights<\/b>.
<\/p>

We deliver innovation through:
<\/p>

  • Agentic Systems for AI Agents<\/b> <\/span>→ <\/span>akira.ai<\/span><\/a><\/p><\/li>

  • Vision AI Platform<\/b> <\/span>→ <\/span>xenonstack.ai<\/span><\/a><\/p><\/li>

  • Inference AI Infrastructure for Agentic Systems<\/b> <\/span>→ <\/span>nexastack.ai<\/span><\/a><\/p><\/li><\/ul>

    Our mission is to accelerate the world\u2019s transition to <\/span>AI + Human Intelligence<\/b> <\/span>by building platforms that are <\/span>scalable, reliable, and observable by design<\/b>.
    <\/p>


    THE OPPORTUNITY<\/b>
    <\/h3>

    We are seeking an <\/span>Agentic Infrastructure Observability Engineer<\/b> <\/span>to design and implement <\/span>end\-to\-end observability frameworks<\/b> <\/span>for AI\-native and multi\-agent systems.
    <\/p>

    This role sits at the heart of <\/span>AgentOps and Reliability Engineering<\/b> <\/span>\u2014 ensuring that <\/span>agents, pipelines, and infrastructure<\/b> <\/span>are monitored, measurable, and continuously optimized.
    <\/p>

    If you thrive on <\/span>metrics, monitoring, and making complex systems transparent and reliable<\/b>, this role offers a chance to define observability for the next generation of enterprise AI.
    <\/p>


    KEY RESPONSIBILITIES<\/b>
    <\/h3>
    • Observability Frameworks<\/b>
      <\/p>

      • Design and implement <\/span>observability pipelines<\/b> <\/span>covering metrics, logs, traces, and cost telemetry for agentic systems.
        <\/p><\/li>

      • Build <\/span>dashboards and alerting systems<\/b> <\/span>to monitor reliability, performance, and drift in real\-time.
        <\/p><\/li><\/ul><\/li>

      • Agentic AI Monitoring<\/b>
        <\/p>

        • Track <\/span>LLM usage, context windows, token allocation, and multi\-agent interactions<\/b>.
          <\/p><\/li>

        • Build monitoring hooks into <\/span>LangChain, LangGraph, MCP, and RAG pipelines<\/b>.
          <\/p><\/li><\/ul><\/li>

        • Reliability & Performance<\/b>
          <\/p>

          • Define and monitor <\/span>SLOs, SLIs, and SLAs<\/b> <\/span>for agentic workflows and inference infrastructure.
            <\/p><\/li>

          • Conduct root cause analysis of <\/span>agent failures, latency issues, and cost spikes<\/b>.
            <\/p><\/li><\/ul><\/li>

          • Automation & Tooling<\/b>
            <\/p>

            • Integrate observability into <\/span>CI/CD and AgentOps pipelines<\/b>.
              <\/p><\/li>

            • Develop custom plugins/scripts to extend observability for LLMs, agents, and data pipelines.
              <\/p><\/li><\/ul><\/li>

            • Collaboration & Reporting<\/b>
              <\/p>

              • Work with <\/span>AgentOps, DevOps, and Data Engineering teams<\/b> <\/span>to ensure system\-wide observability.
                <\/p><\/li>

              • Provide <\/span>executive\-level reporting<\/b> <\/span>on reliability, efficiency, and adoption metrics.
                <\/p><\/li><\/ul><\/li>

              • Continuous Improvement<\/b>
                <\/p>

                • Implement <\/span>feedback loops<\/b> <\/span>to improve agent performance and reduce downtime.
                  <\/p><\/li>

                • Stay updated with <\/span>state\-of\-the\-art observability and AI monitoring frameworks<\/b>.
                  <\/p><\/li><\/ul><\/li><\/ul>


                  SKILLS & QUALIFICATIONS<\/b>
                  <\/h3>

                  Must\-Have<\/b>
                  <\/p>

                  • 3\u20136 years of experience in <\/span>SRE, DevOps, or Observability Engineering<\/b>.
                    <\/p><\/li>

                  • Strong knowledge of <\/span>observability tools<\/b> <\/span>(Prometheus, Grafana, ELK, OpenTelemetry, Jaeger).
                    <\/p><\/li>

                  • Experience with <\/span>cloud\-native infrastructure (AWS, GCP, Azure)<\/b> <\/span>and Kubernetes monitoring.
                    <\/p><\/li>

                  • Proficiency in <\/span>Python, Go, or Bash<\/b> <\/span>for scripting and automation.
                    <\/p><\/li>

                  • Understanding of <\/span>AI/LLM pipelines, RAG systems, and vector databases<\/b>.
                    <\/p><\/li>

                  • Hands\-on with <\/span>CI/CD pipelines and monitoring\-as\-code<\/b>.
                    <\/p><\/li><\/ul>

                    Good\-to\-Have<\/b>
                    <\/p>

                    • Experience with <\/span>AgentOps tools<\/b> <\/span>(LangSmith, PromptLayer, Arize AI, Weights & Biases).
                      <\/p><\/li>

                    • Exposure to <\/span>AI\-specific observability<\/b> <\/span>(token usage, model latency, hallucination tracking).
                      <\/p><\/li>

                    • Knowledge of <\/span>Responsible AI monitoring frameworks<\/b>.
                      <\/p><\/li>

                    • Background in <\/span>BFSI, GRC, SOC, or other regulated industries<\/b>.
                      <\/p><\/li><\/ul>


                      WHY SHOULD YOU JOIN US?<\/b>
                      <\/h3>
                      1. <\/p>

                        Agentic AI Product Company<\/b>
                        <\/div>
                        Build observability frameworks for <\/span>next\-gen enterprise AI systems<\/b>.
                        <\/div>

                        <\/p><\/li>

                      2. <\/p>

                        A Fast\-Growing Category Leader<\/b>
                        <\/div>
                        Be part of one of the fastest\-growing <\/span>AI Foundries<\/b>, powering mission\-critical agent deployments.
                        <\/div>

                        <\/p><\/li>

                      3. <\/p>

                        Career Mobility & Growth<\/b>
                        <\/div>
                        Advance into roles like <\/span>Reliability Architect, AgentOps Lead, or Head of Observability<\/b>.
                        <\/div>

                        <\/p><\/li>

                      4. <\/p>

                        Global Exposure<\/b>
                        <\/div>
                        Work on observability challenges across <\/span>Fortune 500 enterprises and global innovators<\/b>.
                        <\/div>

                        <\/p><\/li>

                      5. <\/p>

                        Create Real Impact<\/b>
                        <\/div>
                        Ensure <\/span>transparency, trust, and resilience<\/b> <\/span>in production\-grade AI systems.
                        <\/div>

                        <\/p><\/li>

                      6. <\/p>

                        Culture of Excellence<\/b>
                        <\/div>
                        Our values \u2014 <\/span>Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession<\/b> <\/span>\u2014 give you autonomy to innovate and accountability to deliver.
                        <\/div>

                        <\/p><\/li>

                      7. <\/p>

                        Responsible AI First<\/b>
                        <\/div>
                        Help enterprises adopt AI that is <\/span>not just powerful, but explainable and auditable<\/b>.
                        <\/div>

                        <\/p><\/li><\/ol>


                        XENONSTACK CULTURE \u2013 JOIN US & MAKE AN IMPACT!<\/b>
                        <\/h3>

                        At XenonStack, we believe in <\/span>shaping the future of intelligent systems<\/b>. We foster a <\/span>culture of cultivation<\/b> <\/span>built on bold, human\-centric leadership principles, where <\/span>deep work, simplicity, and adoption<\/b> <\/span>define everything we do.
                        <\/p>

                        Our Cultural Values<\/b>
                        <\/p>

                        • Agency<\/b> <\/span>\u2013 Be self\-directed and proactive.
                          <\/p><\/li>

                        • Taste<\/b> <\/span>\u2013 Sweat the details and build with precision.
                          <\/p><\/li>

                        • Ownership<\/b> <\/span>\u2013 Take responsibility for outcomes.
                          <\/p><\/li>

                        • Mastery<\/b> <\/span>\u2013 Commit to continuous learning and growth.
                          <\/p><\/li>

                        • Impatience<\/b> <\/span>\u2013 Move fast and embrace progress.
                          <\/p><\/li>

                        • Customer Obsession<\/b> <\/span>\u2013 Always put the customer first.
                          <\/p><\/li><\/ul>

                          Our Product Philosophy<\/b>
                          <\/p>

                          • Obsessed with Adoption<\/b> <\/span>\u2013 Making observability and trust an integral part of enterprise AI.
                            <\/p><\/li>

                          • Obsessed with Simplicity<\/b> <\/span>\u2013 Turning complex monitoring into seamless, actionable insights.
                            <\/p><\/li><\/ul>

                            Be part of our mission to <\/span>accelerate the world\u2019s transition to AI + Human Intelligence<\/b> <\/span>\u2014 by making agentic AI systems <\/span>transparent, observable, and reliable at scale<\/b>.\u200b<\/span>
                            <\/p>

                            \u200b<\/span>
                            <\/div><\/span>