LLM Reliability & Evaluation Engineer

ABOUT XENONSTACK<\/b>
<\/h3>

XenonStack is the fastest\-growing <\/span>Data and AI Foundry for Agentic Systems<\/b>, enabling enterprises to gain <\/span>real\-time and intelligent business insights<\/b>.
<\/p>

We deliver innovation through:
<\/p>

  • Agentic Systems for AI Agents<\/b> <\/span>→ <\/span>akira.ai<\/span><\/a><\/p><\/li>

  • Vision AI Platform<\/b> <\/span>→ <\/span>xenonstack.ai<\/span><\/a><\/p><\/li>

  • Inference AI Infrastructure for Agentic Systems<\/b> <\/span>→ <\/span>nexastack.ai<\/span><\/a><\/p><\/li><\/ul>

    Our mission is to accelerate the world\u2019s transition to <\/span>AI + Human Intelligence<\/b> <\/span>by making AI agents <\/span>reliable, explainable, and enterprise\-ready<\/b>.
    <\/p>


    THE OPPORTUNITY<\/b>
    <\/h3>

    We are seeking an <\/span>LLM Reliability & Evaluation Engineer<\/b> <\/span>to ensure that large language models (LLMs) and agentic AI systems meet <\/span>enterprise\-grade standards of accuracy, safety, and trustworthiness<\/b>.
    <\/p>

    This role focuses on <\/span>evaluating, benchmarking, and stress\-testing<\/b> <\/span>LLMs in real\-world workflows, building frameworks for <\/span>reliability, robustness, and continuous improvement<\/b>. If you thrive at the intersection of <\/span>AI research, applied testing, and responsible deployment<\/b>, this is the role for you.
    <\/p>


    KEY RESPONSIBILITIES<\/b>
    <\/h3>
    • Evaluation Frameworks<\/b>
      <\/p>

      • Design and implement <\/span>LLM evaluation pipelines<\/b> <\/span>covering accuracy, robustness, safety, and bias.
        <\/p><\/li>

      • Develop automated systems for <\/span>benchmarking models<\/b> <\/span>on enterprise\-relevant tasks.
        <\/p><\/li><\/ul><\/li>

      • Reliability Engineering<\/b>
        <\/p>

        • Conduct <\/span>stress tests, adversarial testing, and edge\-case evaluations<\/b>.
          <\/p><\/li>

        • Build tools to measure <\/span>latency, consistency, and error recovery<\/b> <\/span>in multi\-turn interactions.
          <\/p><\/li><\/ul><\/li>

        • Metrics & Monitoring<\/b>
          <\/p>

          • Define KPIs such as <\/span>factual accuracy, hallucination rate, toxicity, and compliance alignment<\/b>.
            <\/p><\/li>

          • Establish real\-time monitoring for <\/span>drift, anomalies, and performance regressions<\/b>.
            <\/p><\/li><\/ul><\/li>

          • Collaboration & Alignment<\/b>
            <\/p>

            • Partner with <\/span>ML engineers, product managers, and domain experts<\/b> <\/span>to align evaluation with business objectives.
              <\/p><\/li>

            • Work with Responsible AI teams to implement <\/span>ethical, explainable, and compliant evaluation practices<\/b>.
              <\/p><\/li><\/ul><\/li>

            • Continuous Improvement<\/b>
              <\/p>

              • Feed insights from evaluation into <\/span>fine\-tuning, RLHF/RLAIF pipelines, and model selection<\/b>.
                <\/p><\/li>

              • Maintain a <\/span>central repository of test cases, benchmarks, and evaluation results<\/b>.
                <\/p><\/li><\/ul><\/li>

              • Research & Innovation<\/b>
                <\/p>

                • Stay current with <\/span>state\-of\-the\-art LLM evaluation techniques<\/b>, from academic benchmarks to applied enterprise metrics.
                  <\/p><\/li>

                • Explore <\/span>automated evaluation using agentic test harnesses and synthetic data generation<\/b>.
                  <\/p><\/li><\/ul><\/li><\/ul>


                  SKILLS & QUALIFICATIONS<\/b>
                  <\/h3>

                  Must\-Have<\/b>
                  <\/p>

                  • 3\u20136 years in <\/span>AI/ML, NLP, or applied model evaluation<\/b>.
                    <\/p><\/li>

                  • Strong understanding of <\/span>LLM architectures, prompt engineering, and failure modes<\/b>.
                    <\/p><\/li>

                  • Hands\-on with <\/span>evaluation frameworks<\/b> <\/span>(Eval harnesses, Ragas, OpenAI Evals, DeepEval).
                    <\/p><\/li>

                  • Proficiency in <\/span>Python<\/b> <\/span>and libraries like <\/span>LangChain, LangGraph, LlamaIndex, Hugging Face<\/b>.
                    <\/p><\/li>

                  • Experience with <\/span>vector databases, RAG pipelines, and knowledge graph integration<\/b>.
                    <\/p><\/li>

                  • Familiarity with <\/span>bias/fairness testing and Responsible AI frameworks<\/b>.
                    <\/p><\/li><\/ul>

                    Good\-to\-Have<\/b>
                    <\/p>

                    • Experience with <\/span>reinforcement learning (RLHF, RLAIF)<\/b> <\/span>and reward modeling.
                      <\/p><\/li>

                    • Exposure to <\/span>agentic evaluation frameworks<\/b> <\/span>(multi\-agent stress testing, synthetic user simulators).
                      <\/p><\/li>

                    • Knowledge of <\/span>compliance and safety requirements<\/b> <\/span>for BFSI, GRC, or SOC use cases.
                      <\/p><\/li>

                    • Contributions to <\/span>open\-source evaluation libraries or research papers<\/b>.
                      <\/p><\/li><\/ul>


                      WHY SHOULD YOU JOIN US?<\/b>
                      <\/h3>
                      1. <\/p>

                        Agentic AI Product Company<\/b>
                        <\/div>
                        Ensure reliability in cutting\-edge AI platforms that are redefining enterprise adoption.
                        <\/div>

                        <\/p><\/li>

                      2. <\/p>

                        A Fast\-Growing Category Leader<\/b>
                        <\/div>
                        Be part of one of the fastest\-growing <\/span>AI Foundries<\/b>, powering Fortune 500 enterprises with trustworthy AI.
                        <\/div>

                        <\/p><\/li>

                      3. <\/p>

                        Career Mobility & Growth<\/b>
                        <\/div>
                        Grow into roles such as <\/span>AI Systems Architect, Responsible AI Engineer, or Reliability Engineering Lead<\/b>.
                        <\/div>

                        <\/p><\/li>

                      4. <\/p>

                        Global Exposure<\/b>
                        <\/div>
                        Work on <\/span>enterprise\-scale evaluation challenges<\/b> <\/span>across BFSI, Healthcare, Telecom, and GRC.
                        <\/div>

                        <\/p><\/li>

                      5. <\/p>

                        Create Real Impact<\/b>
                        <\/div>
                        Your evaluations will directly shape <\/span>production\-grade AI agents used in mission\-critical systems<\/b>.
                        <\/div>

                        <\/p><\/li>

                      6. <\/p>

                        Culture of Excellence<\/b>
                        <\/div>
                        Our values \u2014 <\/span>Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession<\/b> <\/span>\u2014 empower you to innovate fearlessly.
                        <\/div>

                        <\/p><\/li>

                      7. <\/p>

                        Responsible AI First<\/b>
                        <\/div>
                        Join a company that prioritizes <\/span>trustworthy, explainable, and compliant AI<\/b>.
                        <\/div>

                        <\/p><\/li><\/ol>


                        XENONSTACK CULTURE \u2013 JOIN US & MAKE AN IMPACT!<\/b>
                        <\/h3>

                        At XenonStack, we believe in <\/span>shaping the future of intelligent systems<\/b>. We foster a <\/span>culture of cultivation<\/b> <\/span>built on bold, human\-centric leadership principles, where <\/span>deep work, simplicity, and adoption<\/b> <\/span>define everything we do.
                        <\/p>

                        Our Cultural Values<\/b>
                        <\/p>

                        • Agency<\/b> <\/span>\u2013 Be self\-directed and proactive.
                          <\/p><\/li>

                        • Taste<\/b> <\/span>\u2013 Sweat the details and build with precision.
                          <\/p><\/li>

                        • Ownership<\/b> <\/span>\u2013 Take responsibility for outcomes.
                          <\/p><\/li>

                        • Mastery<\/b> <\/span>\u2013 Commit to continuous learning and growth.
                          <\/p><\/li>

                        • Impatience<\/b> <\/span>\u2013 Move fast and embrace progress.
                          <\/p><\/li>

                        • Customer Obsession<\/b> <\/span>\u2013 Always put the customer first.
                          <\/p><\/li><\/ul>

                          Our Product Philosophy<\/b>
                          <\/p>

                          • Obsessed with Adoption<\/b> <\/span>\u2013 Making AI accessible, reliable, and enterprise\-ready.
                            <\/p><\/li>

                          • Obsessed with Simplicity<\/b> <\/span>\u2013 Turning complex evaluation challenges into seamless, automated frameworks.
                            <\/p><\/li><\/ul>

                            Be part of our mission to <\/span>accelerate the world\u2019s transition to AI + Human Intelligence<\/b> <\/span>\u2014 by making AI agents not just powerful, but <\/span>trustworthy and reliable<\/b>.
                            <\/p>


                            <\/div><\/span>