Machine Learning Engineer - AI & ML Evaluation Frameworks

In this role, you will architect and build large-scale evaluation frameworks to interrogate unimodal ML systems and multi-modal foundation models. Beyond infrastructure, you will lead deep-dive ML evaluations, performing failure analysis to uncover performance gaps, reasoning flaws, and edge cases. You will translate findings into actionable insights and work directly with algorithm teams to improve the safety and reliability of our health features. Your work will empower teams across Apple to rapidly evaluate multi-modal sensor fusion while upholding Apple's privacy standards. Minimum Qualifications BS in Computer Science, Machine Learning, Statistics, or related field 3+ years of experience in ML Engineering or Applied ML Strong experience in evaluating supervised, unsupervised, LLMs and deep learning models. Proficiency in Python with the ability to write production-grade code (OOP, CI/CD, Git) Hands-on experience in failure analysis, evaluating LLMs and driving subsequent model improvements Experience building data pipelines, inference frameworks, and automated evaluation systems Strong communication skills to articulate complex technical concepts across technical and non-technical audiences Preferred Qualifications MS/PhD in Computer Science, Machine Learning, Statistics, or related field Experience evaluating LLMs or agentic systems (e.g., LLM-as-a-judge, RAG evaluation) Experience with synthetic data generation and prompt engineering Experience in parallel data processing (Spark, Kubernetes, Airflow) or privacy-preserving ML (Federated Learning) Background in AI Safety, model interpretability, or adversarial testing Interest in digital health and clinical rigor

Similar jobs