Sr. Manager Sys/Test Validation Eng.



WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.




THE ROLE

AMD is seeking a Sr. Manager AI Systems Validation Architect to provide company‑level technical leadership in defining how AMD validates AI platforms as production‑ready, deployable data center solutions at hyperscale.

This highly technical and people leadership role is responsible for validation architecture across:

  • Server-level systems
  • Rack-level deployment environments
  • Cluster-scale AI infrastructures operating in real-world data centers

You will define validation approaches not just in lab environments, but in conditions that mirror hyperscale deployment realities—including:

  • Large-scale rack bring-up
  • Cluster provisioning and fleet validation
  • Data center integration and operations readiness

This role shapes validation strategy at the architectural level, ensuring alignment across CPU, GPU, networking, system architecture, firmware, and distributed software domains.

You will lead a small team of senior architects, while personally operating as a hands-on technical authority—driving system-level validation, influencing cross-functional teams, and guiding AMD AI platforms from design validation to production deployment readiness in data centers at scale.

KEY RESPONSIBILITIES

Management and mentoring of a small team of highly technical validation engineers and architects

Validation Architecture (Lab → Data Center Reality)

  • Architect validation strategies for:
    • Server-level AI platforms (CPU, GPU, memory, IO, firmware, system software)
    • Rack-level systems (power delivery, thermals, networking fabric, failure domains, multi-node integration)
    • Cluster-scale deployments (distributed systems behavior, orchestration, scheduling, resiliency, failover)
  • Define validation strategies that reflect hyperscale deployment environments, including:
    • Rack bring-up, provisioning, and fleet rollout
    • Cluster qualification under real-world workloads
    • Data center networking integration (fabric, storage, control planes)
    • Failure injection, fault isolation, and recovery validation

Data Center Deployment & Operational Validation

  • Drive validation methodologies for full-stack deployment readiness, including:
    • Hardware → firmware → OS → orchestration stack bring-up
    • Cluster provisioning and configuration management workflows
    • Integration with orchestration platforms (Kubernetes, Slurm, internal schedulers)
  • Partner with platform, SRE, and infrastructure teams to validate:
    • Cluster reliability, availability, and serviceability (RAS)
    • Scale-out behavior under load and failure conditions
    • Real-world operational scenarios (node failure, rack isolation, network partition, degraded performance)

Debug Leadership (System + Fleet + Production)

  • Lead and guide complex debug across:
    • Silicon interactions (CPU/GPU)
    • Firmware and system software
    • Networking and distributed systems
    • Data center deployment environments and clusters
  • Support drive root cause analysis for issues seen in:
    • Lab environments
    • Pre-production clusters
    • Data center deployments / fleet-scale validation environments
  • Help provide requirements and even participate on development of frameworks for:
    • Fleet-level telemetry correlation
    • Failure reproduction at scale
    • Cross-layer debug (HW + SW + distributed system behavior)

SRE / Reliability Engineering Influence

  • Define validation approaches aligned with SRE principles, including:
    • Observability, metrics, and telemetry-driven validation
    • Chaos/failure testing and resiliency validation
    • SLA/SLO-driven validation criteria
  • Partner with SRE and infrastructure teams to ensure:
    • Systems are production-hardened before deployment
    • Validation includes operational readiness, not just functional correctness

Leadership & Cross-Org Influence

  • Lead and manage a small team of senior architects
  • Drive technical direction across silicon, platform, and software orgs
  • Act as a system-level authority bridging validation, infrastructure, and deployment teams
  • Represent validation architecture in executive and cross-functional forums

THE PERSON

The successful candidate is a recognized system-level leader with deep experience in data center systems, cluster-scale environments, and large-scale platform validation.

You are equally comfortable:

  • Managing a small team of highly technical validation engineers/architects
  • Driving architecture for validation at the rack and cluster level
  • Debugging complex multi-node failures in real environments
  • Working cross-functionally with hardware, software, and infrastructure teams
  • Operating in hyperscale data centers (Meta, Google, AWS, etc.)

You bring:

  • A strong systems mindset
  • Experience with real-world deployments (not just lab validation)
  • The ability to scale impact through both leadership and technical depth

PREFERRED EXPERIENCE

  • Extensive experience in data center platforms, hyperscale infrastructure, or large distributed systems environments
  • Managing and mentoring small engineering teams
  • Direct experience with:
    • Rack bring-up and cluster deployment workflows
    • Fleet-scale validation or infrastructure qualification
    • Data center operations or infrastructure engineering
  • Strong background in:
    • Distributed systems, cluster orchestration, and scheduling
    • Networking (Ethernet, InfiniBand, RDMA, fabrics)
    • Observability systems (metrics/logging/tracing)
  • Experience in roles such as:
    • SRE, infrastructure engineering, production engineering, or data center validation
    • System-level or fleet-level debugging in production-like environments
  • Proven experience debugging issues across:
    • Hardware (CPU/GPU/platform)
    • Firmware / BIOS / BMC
    • OS / drivers / distributed software stacks
    • Cluster or data center environments
  • Experience with:
    • Failure injection, chaos testing, or reliability validation
    • Large-scale workload validation (AI/ML training, HPC, distributed compute)
  • Hands-on skills in:
    • Scripting, automation, and test tooling
    • Data-driven debugging and telemetry analysis
  • Experience influencing large organizations without direct authority
  • Strong ability to communicate across:
    • Engineering teams
    • Infrastructure/ops teams
    • Executive leadership

ACADEMIC CREDENTIALS

  • Bachelor’s degree in electrical engineering, Computer Engineering, Computer Science
  • Advanced degree or equivalent industry experience preferred

#LI-KW1

This role is not eligible for visa sponsorship.




Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Similar jobs