Senior AI Infrastructure & Networking Engineer
We are seeking an expert Senior AI Infrastructure & Networking Engineer to lead the architecture, deployment, and optimization of our next-generation AI Factory. In this role, you will be responsible for building and scaling high-density GPU supercomputing clusters (up to 512+ nodes) featuring NVIDIA Blackwell UltraB300 systems. You will bridge the gap between heavy physical infrastructure (liquid cooling/busbar power) and advanced logical fabrics, ensuring predictable, line-rate, and lossless transport for massive generative AI training and reasoning workloads.
Key Responsibilities
- AI Fabric Architecture & Deployment: Design, build, and optimize high-throughput, ultra-low-latency East-West compute networks using NVIDIA Spectrum-X Ethernet platforms (Spectrum-4 ASICs) and/or NVIDIA Quantum-X800 InfiniBand switching.
- Performance Tuning for Lossless Networking: Configure and fine-tune critical Layer 2/3 lossless transport mechanisms, including Remote Direct Memory Access over Converged Ethernet (RoCE v2), Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and DCQCN.
- Rail-Optimized Topologies: Implement and maintain non-blocking, multi-plane, full fat-tree network topologies mapped to 8-GPU server architectures to maximize collective communication performance via NCCL (NVIDIA Collective Communications Library).
- SmartNIC & DPU Management: Deploy and manage high-speed compute network interfaces, including ConnectX-8 SuperNICs (800 Gb/s) and BlueField-3 DPUs for isolated infrastructure management, storage acceleration, and multi-tenant security.
- Full-Stack Orchestration & Automation: Drive infrastructure-as-code deployments using Ansible and Terraform. Initialize and monitor the NVIDIA Network Operator within core Kubernetes orchestration layers.
- Telemetry & Validation: Utilize deep network telemetry tools such as NVIDIA NetQ and "What Just Happened" (WJH) to stream real-time switch diagnostics. Conduct line-rate cluster benchmarking using ib_write_bw and ib_write_lat to eliminate physical layer bottlenecks.
- Cross-Functional Infrastructure Alignment: Collaborate closely with data center facility teams on high-density environment metrics (~15–20 kW+ per rack, liquid-cooled rows, Coolant Distribution Units (CDUs), and Rear Door Heat Exchangers). Ensure operational verification aligns with international standards (e.g., IDCA G-Grade or Uptime Institute).
Required Technical Skills &Qualifications
- Education: Bachelor’s or Master’s degree in Computer Science, Network Engineering, Systems Engineering, or a related technical discipline.
- AI Networking Expertise: Proven track record of configuring RoCE v2, adaptive routing, and traffic optimization specifically for machine learning/HPC workloads.
- Hardware Familiarity: Deep understanding of high-density scale-up and scale-out systems (NVIDIA HGX/DGX architectures, PCIe switching, OSFP/QSFP112 optical and copper assemblies).
- Software & Cluster Management: Experience with cluster deployment suites like NVIDIA Mission Control, Base Command Manager, Run:ai, or similar enterprise MLOps frameworks.
- Routing Protocols: Strong proficiency with advanced datacenter networking protocols, particularly eBGP IPv6 unnumbered underlays and EVPN/VXLAN overlays for multi-tenant isolation.
- Cabling & Layer 1 Validation: Experience managing complex structured fiber trunking (MPO-12/MPO-24 APC) and executing layer-1 diagnostics (ibdiagnet, iblinkinfo).
Preferred Certifications
- NVIDIA Certified Professional - AI Networking (NCP-AIN) (Highly Preferred)
- NVIDIA Certified Expert - Cloud End-to-End Fabric (NCE-CEF)
- Advanced networking tracks from major vendors (e.g., CCIE, JNCIE, or Nokia Service Routing Architect) combined with proven data center fabric experience.
What We Offer
- Opportunity to work with first-of-its-kind, world-class AI supercomputing technologies (NVIDIA Blackwell Ultra).
- High-impact role shaping the foundational architecture for enterprise generative AI and large-scale LLM initiatives.
- Competitive salary, comprehensive benefits package, and continuous learning paths for advanced AI operations certifications.