Sr. Manager Sys/Test Validation Eng.

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE

AMD is seeking a Sr. Manager AI Systems Validation Architect to provide company‑level technical leadership in defining how AMD validates AI platforms as production‑ready, deployable data center solutions at hyperscale.

This highly technical and people leadership role is responsible for validation architecture across:

Server-level systems
Rack-level deployment environments
Cluster-scale AI infrastructures operating in real-world data centers

You will define validation approaches not just in lab environments, but in conditions that mirror hyperscale deployment realities—including:

Large-scale rack bring-up
Cluster provisioning and fleet validation
Data center integration and operations readiness

This role shapes validation strategy at the architectural level, ensuring alignment across CPU, GPU, networking, system architecture, firmware, and distributed software domains.

You will lead a small team of senior architects, while personally operating as a hands-on technical authority—driving system-level validation, influencing cross-functional teams, and guiding AMD AI platforms from design validation to production deployment readiness in data centers at scale.

KEY RESPONSIBILITIES

Management and mentoring of a small team of highly technical validation engineers and architects

Validation Architecture (Lab → Data Center Reality)

Architect validation strategies for:

Server-level AI platforms (CPU, GPU, memory, IO, firmware, system software)
Rack-level systems (power delivery, thermals, networking fabric, failure domains, multi-node integration)
Cluster-scale deployments (distributed systems behavior, orchestration, scheduling, resiliency, failover)

Define validation strategies that reflect hyperscale deployment environments, including:

Rack bring-up, provisioning, and fleet rollout
Cluster qualification under real-world workloads
Data center networking integration (fabric, storage, control planes)
Failure injection, fault isolation, and recovery validation

Data Center Deployment & Operational Validation

Drive validation methodologies for full-stack deployment readiness, including:

Hardware → firmware → OS → orchestration stack bring-up
Cluster provisioning and configuration management workflows
Integration with orchestration platforms (Kubernetes, Slurm, internal schedulers)

Partner with platform, SRE, and infrastructure teams to validate:

Cluster reliability, availability, and serviceability (RAS)
Scale-out behavior under load and failure conditions
Real-world operational scenarios (node failure, rack isolation, network partition, degraded performance)

Debug Leadership (System + Fleet + Production)

Lead and guide complex debug across:

Silicon interactions (CPU/GPU)
Firmware and system software
Networking and distributed systems
Data center deployment environments and clusters

Support drive root cause analysis for issues seen in:

Lab environments
Pre-production clusters
Data center deployments / fleet-scale validation environments

Help provide requirements and even participate on development of frameworks for:

Fleet-level telemetry correlation
Failure reproduction at scale
Cross-layer debug (HW + SW + distributed system behavior)

SRE / Reliability Engineering Influence

Define validation approaches aligned with SRE principles, including:

Observability, metrics, and telemetry-driven validation
Chaos/failure testing and resiliency validation
SLA/SLO-driven validation criteria

Partner with SRE and infrastructure teams to ensure:

Systems are production-hardened before deployment
Validation includes operational readiness, not just functional correctness

Leadership & Cross-Org Influence

Lead and manage a small team of senior architects
Drive technical direction across silicon, platform, and software orgs
Act as a system-level authority bridging validation, infrastructure, and deployment teams
Represent validation architecture in executive and cross-functional forums

THE PERSON

The successful candidate is a recognized system-level leader with deep experience in data center systems, cluster-scale environments, and large-scale platform validation.

You are equally comfortable:

Managing a small team of highly technical validation engineers/architects
Driving architecture for validation at the rack and cluster level
Debugging complex multi-node failures in real environments
Working cross-functionally with hardware, software, and infrastructure teams
Operating in hyperscale data centers (Meta, Google, AWS, etc.)

You bring:

A strong systems mindset
Experience with real-world deployments (not just lab validation)
The ability to scale impact through both leadership and technical depth

PREFERRED EXPERIENCE

Extensive experience in data center platforms, hyperscale infrastructure, or large distributed systems environments
Managing and mentoring small engineering teams
Direct experience with:

Rack bring-up and cluster deployment workflows
Fleet-scale validation or infrastructure qualification
Data center operations or infrastructure engineering

Strong background in:

Distributed systems, cluster orchestration, and scheduling
Networking (Ethernet, InfiniBand, RDMA, fabrics)
Observability systems (metrics/logging/tracing)

Experience in roles such as:

SRE, infrastructure engineering, production engineering, or data center validation
System-level or fleet-level debugging in production-like environments

Proven experience debugging issues across:

Hardware (CPU/GPU/platform)
Firmware / BIOS / BMC
OS / drivers / distributed software stacks
Cluster or data center environments

Experience with:

Failure injection, chaos testing, or reliability validation
Large-scale workload validation (AI/ML training, HPC, distributed compute)

Hands-on skills in:

Scripting, automation, and test tooling
Data-driven debugging and telemetry analysis

Experience influencing large organizations without direct authority
Strong ability to communicate across:

Engineering teams
Infrastructure/ops teams
Executive leadership

ACADEMIC CREDENTIALS

Bachelor’s degree in electrical engineering, Computer Engineering, Computer Science
Advanced degree or equivalent industry experience preferred

#LI-KW1

This role is not eligible for visa sponsorship.

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Sr. Manager Sys/Test Validation Eng.

Similar jobs

Sr. Sys/Test Validation Engineer

Sys/Test Validation Engineer

Sys/Test Validation Engineer

Senior System Test Engineering Manager - Product Development Group

Sr. Test Engineer

Senior Technical Validation Program Manager