Mgr, Engineering Program Management, AI Platforms & Infrastructure
We are looking for an experienced Engineering Program Manager (EPM) Manager to lead strategy, execution, and delivery across our AI/ML platform and infrastructure programs. In this role, you will drive cross-functional initiatives spanning Apple’s massive-scale GPU/TPU compute infrastructure, Foundation Model inference platforms, and hybrid-cloud AI systems. You will partner closely with engineering and operations leaders to translate complex technical requirements into actionable roadmaps. Crucially, you will be responsible for growing and scaling a high-performing EPM team to meet the rapidly expanding demands of Apple's generative AI and machine learning platforms.
Minimum Qualifications
10+ years of experience in product or program management, with at least 3+ years in a people management or lead EPM role.
Proven experience building and scaling teams, with the organizational savvy to expand team scope and influence across a highly matrixed environment.
Extensive experience managing strategic relationships with top-tier cloud vendors and external partners, including infrastructure planning, contract alignment, and SLA enforcement.
Strong strategic thinking with the ability to balance long-term platform roadmap priorities against near-term inference and training execution demands.
Track record of delivering massive-scale cost optimization and operational efficiency programs in hybrid-cloud environments.
Excellent communication and stakeholder management skills — able to translate complex technical infrastructure concepts for both deep engineering teams and executive audiences.
Experience in multi-tenant, high-performance compute environments running large-scale Foundation Models or similar ML workloads.
BS/MS in EE/CS/CE or equivalent
Preferred Qualifications
Deep technical background in AI/ML infrastructure, cloud operations, or distributed compute platforms, with direct experience in GPU/TPU capacity management and provisioning.
Familiarity with large-scale distributed training frameworks (e.g., PyTorch, Megatron-LM, JAX) and their infrastructure implications at scale.
Familiarity with FinOps practices in large-scale GPU/TPU environments.
Experience navigating large-scale organizational change and team restructuring.