Agentic Infrastructure Observability Engineer

\u200b<\/span>
<\/div>

ABOUT XENONSTACK<\/b>
<\/h3>
XenonStack is the fastest\-growing <\/span>Data and AI Foundry for Agentic Systems<\/b>, enabling enterprises to gain <\/span>real\-time and intelligent business insights<\/b>.
<\/p>
We deliver innovation through:
<\/p>
Agentic Systems for AI Agents<\/b> <\/span>→ <\/span>akira.ai<\/span><\/a><\/p><\/li>
Vision AI Platform<\/b> <\/span>→ <\/span>xenonstack.ai<\/span><\/a><\/p><\/li>
Inference AI Infrastructure for Agentic Systems<\/b> <\/span>→ <\/span>nexastack.ai<\/span><\/a><\/p><\/li><\/ul>
Our mission is to accelerate the world\u2019s transition to <\/span>AI + Human Intelligence<\/b> <\/span>by building platforms that are <\/span>scalable, reliable, and observable by design<\/b>.
<\/p>
THE OPPORTUNITY<\/b>
<\/h3>
We are seeking an <\/span>Agentic Infrastructure Observability Engineer<\/b> <\/span>to design and implement <\/span>end\-to\-end observability frameworks<\/b> <\/span>for AI\-native and multi\-agent systems.
<\/p>
This role sits at the heart of <\/span>AgentOps and Reliability Engineering<\/b> <\/span>\u2014 ensuring that <\/span>agents, pipelines, and infrastructure<\/b> <\/span>are monitored, measurable, and continuously optimized.
<\/p>
If you thrive on <\/span>metrics, monitoring, and making complex systems transparent and reliable<\/b>, this role offers a chance to define observability for the next generation of enterprise AI.
<\/p>
KEY RESPONSIBILITIES<\/b>
<\/h3>
Observability Frameworks<\/b>
<\/p>
Design and implement <\/span>observability pipelines<\/b> <\/span>covering metrics, logs, traces, and cost telemetry for agentic systems.
<\/p><\/li>
Build <\/span>dashboards and alerting systems<\/b> <\/span>to monitor reliability, performance, and drift in real\-time.
<\/p><\/li><\/ul><\/li>
Agentic AI Monitoring<\/b>
<\/p>
Track <\/span>LLM usage, context windows, token allocation, and multi\-agent interactions<\/b>.
<\/p><\/li>
Build monitoring hooks into <\/span>LangChain, LangGraph, MCP, and RAG pipelines<\/b>.
<\/p><\/li><\/ul><\/li>
Reliability & Performance<\/b>
<\/p>
Define and monitor <\/span>SLOs, SLIs, and SLAs<\/b> <\/span>for agentic workflows and inference infrastructure.
<\/p><\/li>
Conduct root cause analysis of <\/span>agent failures, latency issues, and cost spikes<\/b>.
<\/p><\/li><\/ul><\/li>
Automation & Tooling<\/b>
<\/p>
Integrate observability into <\/span>CI/CD and AgentOps pipelines<\/b>.
<\/p><\/li>
Develop custom plugins/scripts to extend observability for LLMs, agents, and data pipelines.
<\/p><\/li><\/ul><\/li>
Collaboration & Reporting<\/b>
<\/p>
Work with <\/span>AgentOps, DevOps, and Data Engineering teams<\/b> <\/span>to ensure system\-wide observability.
<\/p><\/li>
Provide <\/span>executive\-level reporting<\/b> <\/span>on reliability, efficiency, and adoption metrics.
<\/p><\/li><\/ul><\/li>
Continuous Improvement<\/b>
<\/p>
Implement <\/span>feedback loops<\/b> <\/span>to improve agent performance and reduce downtime.
<\/p><\/li>
Stay updated with <\/span>state\-of\-the\-art observability and AI monitoring frameworks<\/b>.
<\/p><\/li><\/ul><\/li><\/ul>
SKILLS & QUALIFICATIONS<\/b>
<\/h3>
Must\-Have<\/b>
<\/p>
3\u20136 years of experience in <\/span>SRE, DevOps, or Observability Engineering<\/b>.
<\/p><\/li>
Strong knowledge of <\/span>observability tools<\/b> <\/span>(Prometheus, Grafana, ELK, OpenTelemetry, Jaeger).
<\/p><\/li>
Experience with <\/span>cloud\-native infrastructure (AWS, GCP, Azure)<\/b> <\/span>and Kubernetes monitoring.
<\/p><\/li>
Proficiency in <\/span>Python, Go, or Bash<\/b> <\/span>for scripting and automation.
<\/p><\/li>
Understanding of <\/span>AI/LLM pipelines, RAG systems, and vector databases<\/b>.
<\/p><\/li>
Hands\-on with <\/span>CI/CD pipelines and monitoring\-as\-code<\/b>.
<\/p><\/li><\/ul>
Good\-to\-Have<\/b>
<\/p>
Experience with <\/span>AgentOps tools<\/b> <\/span>(LangSmith, PromptLayer, Arize AI, Weights & Biases).
<\/p><\/li>
Exposure to <\/span>AI\-specific observability<\/b> <\/span>(token usage, model latency, hallucination tracking).
<\/p><\/li>
Knowledge of <\/span>Responsible AI monitoring frameworks<\/b>.
<\/p><\/li>
Background in <\/span>BFSI, GRC, SOC, or other regulated industries<\/b>.
<\/p><\/li><\/ul>
WHY SHOULD YOU JOIN US?<\/b>
<\/h3>
<\/p>
Agentic AI Product Company<\/b>
<\/div>
Build observability frameworks for <\/span>next\-gen enterprise AI systems<\/b>.
<\/div>
<\/p><\/li>
<\/p>
A Fast\-Growing Category Leader<\/b>
<\/div>
Be part of one of the fastest\-growing <\/span>AI Foundries<\/b>, powering mission\-critical agent deployments.
<\/div>
<\/p><\/li>
<\/p>
Career Mobility & Growth<\/b>
<\/div>
Advance into roles like <\/span>Reliability Architect, AgentOps Lead, or Head of Observability<\/b>.
<\/div>
<\/p><\/li>
<\/p>
Global Exposure<\/b>
<\/div>
Work on observability challenges across <\/span>Fortune 500 enterprises and global innovators<\/b>.
<\/div>
<\/p><\/li>
<\/p>
Create Real Impact<\/b>
<\/div>
Ensure <\/span>transparency, trust, and resilience<\/b> <\/span>in production\-grade AI systems.
<\/div>
<\/p><\/li>
<\/p>
Culture of Excellence<\/b>
<\/div>
Our values \u2014 <\/span>Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession<\/b> <\/span>\u2014 give you autonomy to innovate and accountability to deliver.
<\/div>
<\/p><\/li>
<\/p>
Responsible AI First<\/b>
<\/div>
Help enterprises adopt AI that is <\/span>not just powerful, but explainable and auditable<\/b>.
<\/div>
<\/p><\/li><\/ol>
XENONSTACK CULTURE \u2013 JOIN US & MAKE AN IMPACT!<\/b>
<\/h3>
At XenonStack, we believe in <\/span>shaping the future of intelligent systems<\/b>. We foster a <\/span>culture of cultivation<\/b> <\/span>built on bold, human\-centric leadership principles, where <\/span>deep work, simplicity, and adoption<\/b> <\/span>define everything we do.
<\/p>
Our Cultural Values<\/b>
<\/p>
Agency<\/b> <\/span>\u2013 Be self\-directed and proactive.
<\/p><\/li>
Taste<\/b> <\/span>\u2013 Sweat the details and build with precision.
<\/p><\/li>
Ownership<\/b> <\/span>\u2013 Take responsibility for outcomes.
<\/p><\/li>
Mastery<\/b> <\/span>\u2013 Commit to continuous learning and growth.
<\/p><\/li>
Impatience<\/b> <\/span>\u2013 Move fast and embrace progress.
<\/p><\/li>
Customer Obsession<\/b> <\/span>\u2013 Always put the customer first.
<\/p><\/li><\/ul>
Our Product Philosophy<\/b>
<\/p>
Obsessed with Adoption<\/b> <\/span>\u2013 Making observability and trust an integral part of enterprise AI.
<\/p><\/li>
Obsessed with Simplicity<\/b> <\/span>\u2013 Turning complex monitoring into seamless, actionable insights.
<\/p><\/li><\/ul>
Be part of our mission to <\/span>accelerate the world\u2019s transition to AI + Human Intelligence<\/b> <\/span>\u2014 by making agentic AI systems <\/span>transparent, observable, and reliable at scale<\/b>.\u200b<\/span>
<\/p>
\u200b<\/span>
<\/div><\/span>

Free, open-source IT job aggregator.

CLI API Ask a question GitHub