Platform Engineer
About HUD
HUD is building infrastructure to create RL training data and evals for frontier AI agents, as well as a marketplace to sell these to frontier labs through the HUD marketplace. Our platform is used by frontier labs, Fortune 500 companies, and startups. We’ve raised $16M from top VCs and were YC W25.
About the role
We’re looking for a platform engineer who can own the reliability, scale, performance, and developer experience of HUD’s core infrastructure and backend systems.
This is not a pure infrastructure role. The right person has strong production infra experience, but also thinks like a backend engineer: they can reason about service architecture, queues, databases, APIs, deployment safety, performance bottlenecks, and how product requirements translate into resilient systems. You’ll work across AWS, Kubernetes, Terraform, CI/CD, observability, and backend services to make HUD faster, more reliable, cheaper to run, and easier for engineers to build on.
Responsibilities
-
Own production uptime, latency, provisioning speed, infrastructure cost, and incident response for core platform services
-
Build and maintain AWS infrastructure with Terraform, Kubernetes/EKS, Helm, Docker, EC2, CodeBuild, ECR, S3, IAM, networking, and secrets management
-
Design and improve backend and platform systems for scale, including capacity planning, autoscaling, queueing, backpressure, cleanup jobs, retries, and rollback paths
-
Define and improve dashboards, alerts, logs, traces, SLOs, runbooks, and on-call workflows so failures are detected, debugged, and resolved quickly
-
Build reliable CI/CD, release automation, environment management, and deployment workflows that improve developer productivity and reduce production risk
-
Write clean, maintainable code where needed to automate systems, improve backend services, and create internal tooling
Experience
You may be a good fit if you:
-
Have owned production cloud infrastructure for a high-availability, user-facing platform, with responsibility for uptime, performance, deployment safety, and cost
-
Have deep experience with AWS infrastructure and containerized systems; experience with tools like Terraform, Kubernetes/EKS, Docker, EC2, CodeBuild, ECR, S3, IAM, load balancers, networking, and secrets management is strongly preferred
-
Have built or operated CI/CD, environment management, release automation, observability, alerting, and incident response systems
-
Have strong backend engineering judgment and can reason about service architecture, APIs, databases, async systems, queues, scaling limits, and production failure modes
-
Can write clean, maintainable code and apply strong software engineering judgment across product architecture, infrastructure, backend systems, and developer workflows
Strong candidates may also have:
-
Experience operating infrastructure for data-heavy, ML/AI, workflow, marketplace, developer-tools, or enterprise platforms
-
Experience designing systems for bursty workloads, long-running jobs, sandboxed execution, distributed workers, or high-concurrency services
-
Experience reducing cloud spend through better architecture, autoscaling, workload placement, caching, cleanup systems, or observability
-
Experience building internal platforms or tools that make engineers faster without hiding too much complexity
We prioritize technical aptitude, ownership, and learning potential over years of experience.
Team & company details
-
Team Size : ~15 people currently, mostly full-time in-person, but some remote.
-
Our team: Our team includes 4 International Olympiad medalists (IOI, ILO, IPhO), serial AI startup founders, and researchers with publications at ICLR, NeurIPS, etc.
-
Company stage: We have 8 figures in funding and high revenue growth. We’re scaling profitably and quickly to meet very strong demand.
Logistics
-
Employment : Full-time.
-
Location : On-site in the San Francisco Bay Area.
-
Visa Sponsorship : We provide support for relocation and visas for strong full-time candidates to the US.
-
Timeline : Applications are rolling. The process is 2 technical interviews and a 1-week work trial.
What we offer
-
Competitive compensation based on experience and location
-
100% covered top-of-the-line medical, dental, and vision from Blue Shield of CA
-
Lunch and dinner when you’re in the office
-
Company-wide holiday break (Christmas Eve to New Year’s Day) on top of PTO and paid holidays
-
Other perks including an Equinox membership, 401k, and commuter benefits
-
Unlimited* access to tokens for ChatGPT, Claude Code, Cursor, etc. *By unlimited, we mean no one on our token usage leaderboard has ever hit a limit. So we have no idea what the limit is.
Due to high volume, we may not actively respond to every application, but feel free to contact us at [[email removed]](mailto:[email removed]) or elsewhere if we missed your application!