Platform Engineer

About HUD

HUD is building infrastructure to create RL training data and evals for frontier AI agents, as well as a marketplace to sell these to frontier labs through the HUD marketplace. Our platform is used by frontier labs, Fortune 500 companies, and startups. We’ve raised $16M from top VCs and were YC W25.

About the role

We’re looking for a platform engineer who can own the reliability, scale, performance, and developer experience of HUD’s core infrastructure and backend systems.

This is not a pure infrastructure role. The right person has strong production infra experience, but also thinks like a backend engineer: they can reason about service architecture, queues, databases, APIs, deployment safety, performance bottlenecks, and how product requirements translate into resilient systems. You’ll work across AWS, Kubernetes, Terraform, CI/CD, observability, and backend services to make HUD faster, more reliable, cheaper to run, and easier for engineers to build on.

Responsibilities

  • Own production uptime, latency, provisioning speed, infrastructure cost, and incident response for core platform services

  • Build and maintain AWS infrastructure with Terraform, Kubernetes/EKS, Helm, Docker, EC2, CodeBuild, ECR, S3, IAM, networking, and secrets management

  • Design and improve backend and platform systems for scale, including capacity planning, autoscaling, queueing, backpressure, cleanup jobs, retries, and rollback paths

  • Define and improve dashboards, alerts, logs, traces, SLOs, runbooks, and on-call workflows so failures are detected, debugged, and resolved quickly

  • Build reliable CI/CD, release automation, environment management, and deployment workflows that improve developer productivity and reduce production risk

  • Write clean, maintainable code where needed to automate systems, improve backend services, and create internal tooling

Experience

You may be a good fit if you:

  • Have owned production cloud infrastructure for a high-availability, user-facing platform, with responsibility for uptime, performance, deployment safety, and cost

  • Have deep experience with AWS infrastructure and containerized systems; experience with tools like Terraform, Kubernetes/EKS, Docker, EC2, CodeBuild, ECR, S3, IAM, load balancers, networking, and secrets management is strongly preferred

  • Have built or operated CI/CD, environment management, release automation, observability, alerting, and incident response systems

  • Have strong backend engineering judgment and can reason about service architecture, APIs, databases, async systems, queues, scaling limits, and production failure modes

  • Can write clean, maintainable code and apply strong software engineering judgment across product architecture, infrastructure, backend systems, and developer workflows

Strong candidates may also have:

  • Experience operating infrastructure for data-heavy, ML/AI, workflow, marketplace, developer-tools, or enterprise platforms

  • Experience designing systems for bursty workloads, long-running jobs, sandboxed execution, distributed workers, or high-concurrency services

  • Experience reducing cloud spend through better architecture, autoscaling, workload placement, caching, cleanup systems, or observability

  • Experience building internal platforms or tools that make engineers faster without hiding too much complexity

We prioritize technical aptitude, ownership, and learning potential over years of experience.

Team & company details

  • Team Size : ~15 people currently, mostly full-time in-person, but some remote.

  • Our team: Our team includes 4 International Olympiad medalists (IOI, ILO, IPhO), serial AI startup founders, and researchers with publications at ICLR, NeurIPS, etc.

  • Company stage: We have 8 figures in funding and high revenue growth. We’re scaling profitably and quickly to meet very strong demand.

Logistics

  • Employment : Full-time.

  • Location : On-site in the San Francisco Bay Area.

  • Visa Sponsorship : We provide support for relocation and visas for strong full-time candidates to the US.

  • Timeline : Applications are rolling. The process is 2 technical interviews and a 1-week work trial.

What we offer

  • Competitive compensation based on experience and location

  • 100% covered top-of-the-line medical, dental, and vision from Blue Shield of CA

  • Lunch and dinner when you’re in the office

  • Company-wide holiday break (Christmas Eve to New Year’s Day) on top of PTO and paid holidays

  • Other perks including an Equinox membership, 401k, and commuter benefits

  • Unlimited* access to tokens for ChatGPT, Claude Code, Cursor, etc. *By unlimited, we mean no one on our token usage leaderboard has ever hit a limit. So we have no idea what the limit is.

Due to high volume, we may not actively respond to every application, but feel free to contact us at [[email removed]](mailto:[email removed]) or elsewhere if we missed your application!