Manager, Machine Learning Infrastructure - SIML
In this role you will lead a team responsible for building and operating infrastructure that enables large-scale data processing (terabytes and beyond) across domains such as image generation, large language models (LLMs), computer vision, natural language processing, human-computer interaction, and text recognition. This includes designing systems for dataset creation and management, ingesting annotated and inferred data, and delivering seamless access to high-quality data for ML researchers and engineers.
A key part of this role is driving systems that enable deeper understanding of model behavior—such as failure mode analysis, evaluation pipelines, and benchmarking frameworks—to accelerate iteration velocity and improve model quality. You will work across the stack, tackling challenges ranging from low-level distributed systems and compute efficiency to building stable, intuitive interfaces for internal users.
As a leader, you will partner closely with cross-functional teams including ML researchers, product teams, and platform engineering to define roadmaps, align priorities, and deliver impactful solutions. You will play a critical role in shaping how ML systems are developed, evaluated, and scaled from early experimentation to production.
Minimum Qualifications
Bachelor’s, Master’s, or Ph.D. in Computer Science, Computer Engineering, or a related field (or equivalent experience)
7+ years of software engineering experience, with 2+ years in a technical leadership or management role
Strong programming skills in one or more of: Python, Java, Go, C/C++
Solid understanding of machine learning fundamentals and ML system workflows
Proven experience in building and scaling distributed systems and backend infrastructure
Strong system design skills with expertise in at least one systems domain (e.g., data infrastructure, distributed systems, ML platforms)
Preferred Qualifications
Experience building infrastructure for ML workflows (data pipelines, training systems, evaluation frameworks, or deployment systems)
Domain experience in areas such as AI/ML, computer vision, NLP, or related fields
Experience working with large-scale datasets and compute-intensive systems
Experience improving developer productivity through tooling and platform abstractions
Ability to operate effectively in cross-functional, fast-paced environments with evolving requirements