ML Compute Efficiency Automation Engineer, Infrastructure & Planning

- Govern compute as code. Build the systems of record for resource requests, allocations, and utilization, accurate and at scale, so leadership can trust the numbers. - Hunt down ML inefficiency. Dig into inference and training workloads across GPUs, TPUs, and custom Apple Silicon, find where compute is wasted, trace it to a cause, and drive the fix. - Work the real optimization problems: scheduling, capacity allocation, and serving cost, alongside the engineers who own those systems. - Get rid of the toil. Replace the time-sink workflows, triage, reporting, reconciliation, with systems that handle the routine and pull a person in only when judgment matters. Drive manual escalations toward zero instead of standing up a tiered on call org. - Make the data useful. Build the telemetry, schemas, and anomaly detection that surface efficiency and cost opportunities, then wire them into tooling that acts rather than just files a report. - Rebuild what breaks at scale. When a process buckles under Apple scale ML demand, re-architect it so it grows with usage instead of headcount. - Make a lasting impact. Turn what you build into reusable tooling so the rest of the team benefits without coming back to you each time. Minimum Qualifications BS in Computer Science, Computer Engineering, or equivalent practical experience. 6 or more years building production software, automation, tooling, or data and infrastructure systems. A problem solver who builds first. You have designed things from scratch to wipe out manual work or get past a scale ceiling, and you can show us something you built. Fluent with AI tooling. Coding assistants as part of how you already work, not something you read about. Strong programming skills, Python or similar, or automation, pipelines, and tooling. SQL, plus dashboards or data products in something like Tableau, Looker, or Grafana. Experience designing data models or telemetry schemas for infrastructure, capacity, or utilization data. Experience running complex systems in a large scale compute, cloud, or infrastructure environment. Experience knowing where not to automate, and how to guardrail systems that act on their own. Strong cross-team collaborator who moves work forward through influence rather than authority, and is comfortable owning systems others rely on daily. Preferred Qualifications Production experience shipping automated or autonomous workflows Understanding of ML training and inference infrastructure, GPU and TPU utilization, training throughput, scheduling efficiency, and foundation model serving Experience building automated alerting or anomaly detection for infrastructure metrics Experience with FinOps, capacity planning, cloud cost management, or IT governance Knowledge of Django/Postgres Love for open-ended "go figure it out and build it" projects

Similar jobs