Data Center Site Lead
Oracle Cloud Infrastructure (OCI) is building the next generation cloud platform that operates at hyperscale across a rapidly expanding global footprint. OCI's mission is to provide customers with high-performance, highly available, and secure cloud infrastructure services. As part of our continued growth, we are seeking an experienced Data Center Site Lead to oversee day-to-day operations, infrastructure deployments, and colocation partner management within our mission-critical data center environments.
Role OverviewThe Data Center Site Lead is responsible for leading operational excellence across OCI data center facilities, ensuring high availability, safety, and performance of critical infrastructure. This role will oversee rack deployment activities, infrastructure commissioning, environmental monitoring, and operational governance with colocation providers. The successful candidate will bring hands-on experience from hyperscale cloud environments and possess a strong understanding of both mechanical and electrical systems supporting modern data centers, including liquid-cooled deployments.
This position requires a highly collaborative leader capable of coordinating cross-functional teams, driving operational rigor, and ensuring adherence to service level agreements (SLAs) and operational standards.
Key Responsibilities Data Center Operations- Lead day-to-day operations of OCI data center facilities to ensure maximum uptime, reliability, and operational efficiency.
- Serve as the primary site operational lead for mission-critical infrastructure and customer-impacting events.
- Drive operational readiness and continuous improvement initiatives across the site.
- Oversee server rack deployments, hardware installations, and capacity expansion projects.
- Coordinate with internal engineering, network, logistics, and deployment teams to ensure timely execution of infrastructure rollouts.
- Support implementation and operational management of liquid-cooled data halls and associated cooling infrastructure.
- Support commissioning and acceptance testing of electrical and mechanical infrastructure, including:
- UPS systems
- Switchgear
- Power distribution systems
- Generators
- Cooling systems
- CRAH/CRAC units
- Liquid cooling systems
- Building Management Systems (BMS)
- Validate operational readiness prior to production handover.
- Monitor and manage critical environmental parameters, including:
- Temperature
- Humidity
- Airflow
- Power utilization
- Cooling performance
- Water leak detection systems
- Ensure compliance with OCI operational standards, safety requirements, and regulatory obligations.
- Drive root cause analysis and corrective actions for environmental excursions or operational anomalies.
- Act as the primary operational interface with colocation providers.
- Conduct regular operational governance meetings and service reviews.
- Monitor and enforce contractual SLA adherence and service performance metrics.
- Escalate and resolve facility-related issues impacting operations.
- Review maintenance activities, change management plans, and risk assessments with providers.
- Lead operational bridge calls during incidents and critical events.
- Coordinate cross-functional response teams to restore services and mitigate risks.
- Ensure proper execution of change management processes and operational procedures.
- Drive post-incident reviews and corrective action tracking.
- Provide leadership and guidance to site operations personnel and supporting vendors.
- Collaborate with global operations, engineering, network, security, and capacity planning teams.
- Develop and maintain site operating procedures, runbooks, and operational documentation.
- Bachelor's degree in Engineering, Data Center Operations, Facilities Management, or a related technical discipline, or equivalent practical experience.
- 8+ years of experience in data center operations, facilities engineering, or critical environment management.
- Prior experience working within a hyperscale cloud provider environment (e.g., Oracle Cloud, AWS, Microsoft Azure, Google Cloud, Meta, or similar).
- Demonstrated experience operating and supporting liquid-cooled data center environments.
- Experience managing rack deployment programs and large-scale hardware installations.
- Strong understanding of critical electrical and mechanical systems supporting data centers.
- Experience supporting commissioning, testing, and handover of data center infrastructure.
- Experience managing relationships with colocation providers and external vendors.
- Proven experience conducting operational reviews, governance meetings, and SLA performance assessments.
- Strong incident management and operational escalation experience.
- Excellent communication, stakeholder management, and leadership skills.
- Experience operating large-scale AI, HPC, or GPU-intensive infrastructure environments.
- Knowledge of data center monitoring systems, BMS, DCIM, and environmental management platforms.
- Familiarity with ITIL-based operational processes.
- Project management experience supporting capacity expansion and infrastructure programs.
- Data center certifications such as:
- CDCP
- CDCS
- DCEP
- Uptime Institute Certifications
- Relevant electrical or mechanical engineering certifications
- Operational Excellence
- Critical Infrastructure Management
- Hyperscale Data Center Operations
- Liquid Cooling Technologies
- Vendor & Colocation Management
- Incident Command & Escalation Management
- Commissioning & Infrastructure Readiness
- Service Level Governance
- Leadership & Team Development
- Cross-Functional Collaboration
At Oracle Cloud Infrastructure, you will play a critical role in building and operating one of the world's fastest-growing cloud platforms. You'll work alongside industry-leading experts, influence the design and operation of next-generation data centers, and contribute directly to OCI's global expansion and innovation initiatives.
Required Technical Skills & Expertise
Critical Electrical Infrastructure
- Strong understanding of end-to-end data center power train architecture, including:
- Utility power feeds
- Substations and transformers
- Medium- and low-voltage switchgear
- Automatic Transfer Switches (ATS)
- Static Transfer Switches (STS)
- Uninterruptible Power Supply (UPS) systems
- Power Distribution Units (PDUs)
- Remote Power Panels (RPPs)
- Busway systems
- Generator systems and fuel infrastructure
- Ability to assess power capacity, redundancy models (N, N+1, 2N), and operational risk.
Mechanical & Cooling Systems
- Deep knowledge of data center mechanical systems, including:
- Chillers
- Cooling towers
- CRAH/CRAC units
- Direct-to-chip liquid cooling systems
- CDU (Coolant Distribution Unit) operations
- Heat rejection systems
- Water treatment and leak detection systems
- Building Management Systems (BMS)
- Experience troubleshooting thermal performance and optimizing cooling efficiency in high-density environments.
IT Systems & Hardware Operations
- Experience supporting hyperscale server deployment and lifecycle management.
- Strong understanding of:
- Server hardware architecture
- Storage systems
- RAID configurations and storage resiliency concepts
- Firmware and hardware maintenance procedures
- Asset lifecycle management
- Network rack integration and structured cabling practices
- Familiarity with hardware diagnostics, break-fix processes, and operational readiness testing.
Industrial Controls & Monitoring Systems
- Experience with data center monitoring and automation platforms, including:
- DCIM platforms
- BMS and EPMS systems
- Environmental monitoring systems
- Understanding of industrial communication protocols such as:
- Modbus TCP/IP
- Modbus RTU
- SNMP
- BACnet
- OPC-based monitoring architectures
- Ability to interpret telemetry, alarms, trends, and infrastructure performance metrics.
Vendor & Colocation Ecosystem Management
- Strong understanding of the data center vendor landscape across:
- Electrical infrastructure providers
- HVAC and liquid cooling manufacturers
- Power systems vendors
- Monitoring and controls platforms
- Experience working with leading OEMs and service providers such as Schneider Electric, Vertiv, Eaton, Siemens, ABB, Cummins, Caterpillar, Trane, Johnson Controls, Stulz, Carrier, and equivalent industry vendors.
- Ability to coordinate maintenance, commissioning, warranty support, and escalation management across multiple vendors and service partners.
Operational Governance & Service Management
- Experience leading operational reviews with colocation providers and service partners.
- Strong understanding of:
- SLA and KPI management
- Change management processes
- Preventive and corrective maintenance programs
- Incident management and root cause analysis (RCA)
- Risk assessments and operational readiness reviews
- Proven ability to lead technical bridge calls and coordinate cross-functional response teams during critical incidents.
Additional Preferred Qualifications
- Experience supporting AI/HPC clusters and GPU-based infrastructure deployments.
- Knowledge of ASHRAE thermal guidelines and modern liquid cooling standards.
- Familiarity with energy efficiency metrics including PUE, WUE, and cooling optimization strategies.
- Experience with commissioning methodologies and Integrated Systems Testing (IST).
- Understanding of sustainability initiatives and energy management programs within hyperscale data centers.
Career Level - IC4