Software Development Manager (EC2 Nitro), EC2 Core Provisioning
build, deploy, and manage applications with unparalleled flexibility and efficiency.
Join our dynamic team, where we apply agentic and machine-learning solutions to one of the hardest problems in the fleet: returning broken servers to production when there is no deterministic signal of what is wrong. You will build a learning, agent-driven decision engine on top of fleet telemetry and repair history, and you will ship it as a production software service that operates across millions of servers in every region, serving every EC2 line of business from core servers to accelerators and UltraServers. Sentinel is a direct lever on unsellable capacity and on the cost of running the fleet, and we are evolving it from human-authored decision rules into a system that recovers capacity on its own.
We are looking for an experienced Software Development Manager (SDM) to lead this team. The ideal candidate has led teams, thoroughly understands the design, development, and debugging of large-scale distributed systems, and is excited to apply ML and agentic techniques to real hardware-recovery problems. In this role, the manager will work with a broad group of technical teams across hardware, firmware, vetting, and provisioning.
Key job responsibilities
- Lead and inspire a team of engineers, providing guidance, mentorship, and support to foster their professional growth.
- Own the recovery decision engine that returns broken servers to sellable capacity, driving down unsellable rate and the time a host stays stuck. Take on the failures that have no deterministic signal, and evolve the engine from static, human-authored signatures into an agentic, ML-driven system that infers the right repair from fleet outcomes and improves with every recovery.
- Build and operate this as a production software service — reliable, secure, and observable — running across millions of servers in every region, not a set of offline models or scripts.
- Debug complex, system-level, multi-component failures across hardware, firmware, BMC, and the provisioning and vetting stack, and turn that diagnosis into automated, repeatable recovery.
- Collaborate with hardware engineering, firmware, component owners, vetting, and provisioning teams to expand recovery coverage across platforms and drive failures upstream to their root cause so they stop recurring.
- Raise the bar on the safety of autonomous action on production-bound capacity, holding a high security and operational standard for a service that runs across all regions, including restricted environments.
- Champion best practices in software engineering, including code quality, testing, automation, and continuous integration and delivery (CI/CD).
Join our dynamic team, where we apply agentic and machine-learning solutions to one of the hardest problems in the fleet: returning broken servers to production when there is no deterministic signal of what is wrong. You will build a learning, agent-driven decision engine on top of fleet telemetry and repair history, and you will ship it as a production software service that operates across millions of servers in every region, serving every EC2 line of business from core servers to accelerators and UltraServers. Sentinel is a direct lever on unsellable capacity and on the cost of running the fleet, and we are evolving it from human-authored decision rules into a system that recovers capacity on its own.
We are looking for an experienced Software Development Manager (SDM) to lead this team. The ideal candidate has led teams, thoroughly understands the design, development, and debugging of large-scale distributed systems, and is excited to apply ML and agentic techniques to real hardware-recovery problems. In this role, the manager will work with a broad group of technical teams across hardware, firmware, vetting, and provisioning.
Key job responsibilities
- Lead and inspire a team of engineers, providing guidance, mentorship, and support to foster their professional growth.
- Own the recovery decision engine that returns broken servers to sellable capacity, driving down unsellable rate and the time a host stays stuck. Take on the failures that have no deterministic signal, and evolve the engine from static, human-authored signatures into an agentic, ML-driven system that infers the right repair from fleet outcomes and improves with every recovery.
- Build and operate this as a production software service — reliable, secure, and observable — running across millions of servers in every region, not a set of offline models or scripts.
- Debug complex, system-level, multi-component failures across hardware, firmware, BMC, and the provisioning and vetting stack, and turn that diagnosis into automated, repeatable recovery.
- Collaborate with hardware engineering, firmware, component owners, vetting, and provisioning teams to expand recovery coverage across platforms and drive failures upstream to their root cause so they stop recurring.
- Raise the bar on the safety of autonomous action on production-bound capacity, holding a high security and operational standard for a service that runs across all regions, including restricted environments.
- Champion best practices in software engineering, including code quality, testing, automation, and continuous integration and delivery (CI/CD).