Principal LLM Inference Engineer
At d-Matrix, we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture is one of respect and collaboration.
We value humility and believe in direct communication. Our team is inclusive, and our differing perspectives allow for better solutions. We are seeking individuals passionate about tackling challenges and are driven by execution. Ready to come find your playground? Together, we can help shape the endless possibilities of AI.
D-Matrix Frontier Group sits at the leading edge of what’s possible with LLM inference on heterogeneous hardware. Our charter spans the full stack: from pathfinding emerging use cases and novel deployment patterns to deep optimization of inference kernels, to building proof-of-concept systems that showcase D-Matrix’s unique computational fabric. We are an applied research and engineering team that moves fast, ships real systems, and works directly with product and hardware teams to shape the roadmap.
We build the tools, runtimes, and frameworks that let frontier AI models run efficiently and cost-effectively across heterogeneous deployments — combining D-Matrix silicon with CPUs, GPUs, and custom accelerators. Our work powers everything from benchmarking and evaluation pipelines to production-grade inference serving.
This Role
We are hiring end-to-end inference engineers who are comfortable going from a novel research idea to a deployed, optimized system. You will work at every layer of the inference stack — from kernel-level optimization to distributed orchestration to high-level serving APIs.
This role could be a great match for you if you:
• Have deep intuition for modern generative AI architectures and how to squeeze performance out of them at inference time.
• Are familiar with the internals of open-source inference frameworks (vLLM, SGLang, TensorRT-LLM, etc.) and can extend or replace them when needed.
• Enjoy pathfinding new use cases — exploring heterogeneous deployment topologies and building early-stage POCs that prove out new ideas.
• Are results-oriented with a strong bias toward action; you own problems end-to-end from prototype to optimization to handoff.
• Are energized by working at the intersection of novel hardware and frontier models, and want your work to directly influence how next-generation AI silicon is used.
• Value clear communication and thrive in a small, high-ownership team environment.
Responsibilities
• Identify and prototype emerging LLM inference use cases suited to heterogeneous hardware deployments.
• Build compelling proof-of-concept systems that demonstrate D-Matrix capabilities to customers, partners, and internal stakeholders.
• Develop and tune custom kernels and operator-level optimizations to maximize throughput and minimize latency.
• Drive quantization, sparsity, and batching strategies tailored to D-Matrix computational model.
• Build and maintain inference runtimes, serving frameworks, and evaluation tooling.
• Contribute to distributed inference systems: tensor/pipeline parallelism, disaggregated prefill/decode, KV-cache management.
• Work closely with hardware architects to provide firmware and compiler teams with actionable inference workload insights.
• Partner with product and business development to translate POCs into customer-facing demonstrations.
• Contribute to technical publications, whitepapers, and open-source projects that advance D-Matrix visibility.
Required Qualifications
• Bachelor’s degree in Computer Science, Electrical Engineering, or a related field, and 10+ years of relevant engineering experience; or equivalent demonstrated experience.
• Master’s or PhD in Computer Science, Electrical Engineering, or a related field preferred, with 6+ years of relevant industry experience.
• Strong proficiency in Python and C/C++.
• Hands-on experience optimizing LLM inference — attention kernels, KV cache, batching strategies, quantization (INT8/FP8/INT4).
• Experience with at least one major inference framework (vLLM, SGLang, TensorRT-LLM, ONNX Runtime, or similar) at a contributor level.
• Familiarity with GPU kernel programming (CUDA/Triton) and performance profiling tools.
Preferred Qualifications
• Experience with heterogeneous compute deployments — scheduling inference workloads across dissimilar hardware (accelerators, CPUs, GPUs).
• Familiarity with custom silicon or ASIC-based inference (beyond GPU-only environments).
• Experience with distributed inference: tensor parallelism, pipeline parallelism, disaggregated serving.
• Contributions to open-source inference or ML systems projects.
• Experience with production inference serving at scale (latency SLOs, continuous batching, multi-model serving).
• Familiarity with speculative decoding, mixture-of-experts routing, or long-context serving techniques.
• Working familiarity with the material in the JAX Scaling Book or equivalent systems-level understanding of modern LLM training and inference.
Why D-Matrix Frontier Group
• Work on genuinely novel hardware — D-Matrix in-memory compute architecture opens up inference optimization problems that don’t exist anywhere else.
• End-to-end ownership from idea to deployed system, with a short feedback loop between your work and real hardware.
• Small, senior team with high autonomy and direct influence on product direction.
• Competitive compensation, equity, and benefits in Santa Clara, CA.
Equal Opportunity Employment Policy
d-Matrix is proud to be an equal opportunity workplace and affirmative action employer. We’re committed to fostering an inclusive environment where everyone feels welcomed and empowered to do their best work. We hire the best talent for our teams, regardless of race, religion, color, age, disability, sex, gender identity, sexual orientation, ancestry, genetic information, marital status, national origin, political affiliation, or veteran status. Our focus is on hiring teammates with humble expertise, kindness, dedication and a willingness to embrace challenges and learn together every day.
d-Matrix does not accept resumes or candidate submissions from external agencies. We appreciate the interest and effort of recruitment firms, but we kindly request that individual interested in opportunities with d-Matrix apply directly through our official channels. This approach allows us to streamline our hiring processes and maintain a consistent and fair evaluation of al applicants. Thank you for your understanding and cooperation.