LLM Reliability & Evaluation Engineer
ABOUT XENONSTACK<\/b>
<\/h3>
XenonStack is the fastest\-growing <\/span>Data and AI Foundry for Agentic Systems<\/b>, enabling enterprises to gain <\/span>real\-time and intelligent business insights<\/b>. We deliver innovation through: Agentic Systems for AI Agents<\/b> <\/span>→ <\/span>akira.ai<\/span><\/a><\/p><\/li> Vision AI Platform<\/b> <\/span>→ <\/span>xenonstack.ai<\/span><\/a><\/p><\/li> Inference AI Infrastructure for Agentic Systems<\/b> <\/span>→ <\/span>nexastack.ai<\/span><\/a><\/p><\/li><\/ul> Our mission is to accelerate the world\u2019s transition to <\/span>AI + Human Intelligence<\/b> <\/span>by making AI agents <\/span>reliable, explainable, and enterprise\-ready<\/b>. We are seeking an <\/span>LLM Reliability & Evaluation Engineer<\/b> <\/span>to ensure that large language models (LLMs) and agentic AI systems meet <\/span>enterprise\-grade standards of accuracy, safety, and trustworthiness<\/b>. This role focuses on <\/span>evaluating, benchmarking, and stress\-testing<\/b> <\/span>LLMs in real\-world workflows, building frameworks for <\/span>reliability, robustness, and continuous improvement<\/b>. If you thrive at the intersection of <\/span>AI research, applied testing, and responsible deployment<\/b>, this is the role for you. Evaluation Frameworks<\/b> Design and implement <\/span>LLM evaluation pipelines<\/b> <\/span>covering accuracy, robustness, safety, and bias. Develop automated systems for <\/span>benchmarking models<\/b> <\/span>on enterprise\-relevant tasks. Reliability Engineering<\/b> Conduct <\/span>stress tests, adversarial testing, and edge\-case evaluations<\/b>. Build tools to measure <\/span>latency, consistency, and error recovery<\/b> <\/span>in multi\-turn interactions. Metrics & Monitoring<\/b> Define KPIs such as <\/span>factual accuracy, hallucination rate, toxicity, and compliance alignment<\/b>. Establish real\-time monitoring for <\/span>drift, anomalies, and performance regressions<\/b>. Collaboration & Alignment<\/b> Partner with <\/span>ML engineers, product managers, and domain experts<\/b> <\/span>to align evaluation with business objectives. Work with Responsible AI teams to implement <\/span>ethical, explainable, and compliant evaluation practices<\/b>. Continuous Improvement<\/b> Feed insights from evaluation into <\/span>fine\-tuning, RLHF/RLAIF pipelines, and model selection<\/b>. Maintain a <\/span>central repository of test cases, benchmarks, and evaluation results<\/b>. Research & Innovation<\/b> Stay current with <\/span>state\-of\-the\-art LLM evaluation techniques<\/b>, from academic benchmarks to applied enterprise metrics. Explore <\/span>automated evaluation using agentic test harnesses and synthetic data generation<\/b>. Must\-Have<\/b> 3\u20136 years in <\/span>AI/ML, NLP, or applied model evaluation<\/b>. Strong understanding of <\/span>LLM architectures, prompt engineering, and failure modes<\/b>. Hands\-on with <\/span>evaluation frameworks<\/b> <\/span>(Eval harnesses, Ragas, OpenAI Evals, DeepEval). Proficiency in <\/span>Python<\/b> <\/span>and libraries like <\/span>LangChain, LangGraph, LlamaIndex, Hugging Face<\/b>. Experience with <\/span>vector databases, RAG pipelines, and knowledge graph integration<\/b>. Familiarity with <\/span>bias/fairness testing and Responsible AI frameworks<\/b>. Good\-to\-Have<\/b> Experience with <\/span>reinforcement learning (RLHF, RLAIF)<\/b> <\/span>and reward modeling. Exposure to <\/span>agentic evaluation frameworks<\/b> <\/span>(multi\-agent stress testing, synthetic user simulators). Knowledge of <\/span>compliance and safety requirements<\/b> <\/span>for BFSI, GRC, or SOC use cases. Contributions to <\/span>open\-source evaluation libraries or research papers<\/b>. <\/p> <\/p><\/li> <\/p> <\/p><\/li> <\/p> <\/p><\/li> <\/p> <\/p><\/li> <\/p> <\/p><\/li> <\/p> <\/p><\/li> <\/p> <\/p><\/li><\/ol> At XenonStack, we believe in <\/span>shaping the future of intelligent systems<\/b>. We foster a <\/span>culture of cultivation<\/b> <\/span>built on bold, human\-centric leadership principles, where <\/span>deep work, simplicity, and adoption<\/b> <\/span>define everything we do. Our Cultural Values<\/b> Agency<\/b> <\/span>\u2013 Be self\-directed and proactive. Taste<\/b> <\/span>\u2013 Sweat the details and build with precision. Ownership<\/b> <\/span>\u2013 Take responsibility for outcomes. Mastery<\/b> <\/span>\u2013 Commit to continuous learning and growth. Impatience<\/b> <\/span>\u2013 Move fast and embrace progress. Customer Obsession<\/b>
<\/p>
<\/p>
<\/p>THE OPPORTUNITY<\/b>
<\/h3>
<\/p>
<\/p>KEY RESPONSIBILITIES<\/b>
<\/h3>
<\/p>
<\/p><\/li>
<\/p><\/li><\/ul><\/li>
<\/p>
<\/p><\/li>
<\/p><\/li><\/ul><\/li>
<\/p>
<\/p><\/li>
<\/p><\/li><\/ul><\/li>
<\/p>
<\/p><\/li>
<\/p><\/li><\/ul><\/li>
<\/p>
<\/p><\/li>
<\/p><\/li><\/ul><\/li>
<\/p>
<\/p><\/li>
<\/p><\/li><\/ul><\/li><\/ul>SKILLS & QUALIFICATIONS<\/b>
<\/h3>
<\/p>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li><\/ul>
<\/p>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li><\/ul>WHY SHOULD YOU JOIN US?<\/b>
<\/h3>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>
<\/div>XENONSTACK CULTURE \u2013 JOIN US & MAKE AN IMPACT!<\/b>
<\/h3>
<\/p>
<\/p>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li>
<\/p><\/li>