Nebius · 9 hours ago
Data Center Site Manager
Nebius is leading a new era in cloud computing to serve the global AI economy. The Data Center Site Manager will own end-to-end reliability, safety, capacity, and performance for one of the flagship U.S. sites, leading a multi-disciplinary operations team to ensure world-class availability and cost efficiency.
AI InfrastructureCloud InfrastructureGPUIaaSPaaS
Responsibilities
Own the site 24/7: deliver continuous availability across power, cooling, structured cabling, network, security, and DCIM—meeting or beating global SLAs
Build and lead the team: hire, mentor, and develop managers/technicians; run staffing models, shift coverage, and on‑call rotations that scale
Be the incident commander: lead major events end‑to‑end—triage, communications, executive briefings, RCA, and durable corrective actions
Drive reliability engineering: implement RCM, predictive maintenance, QA/QC, 5S, and Lean/continuous improvement to cut MTTR and raise MTBF
Deliver capacity on time: plan and execute expansions/retrofits; commission MEP systems with Design/Construction; achieve flawless change control (MOP/SOP/EOP)
Scale tooling & automation: mature DCIM/BMS/EPMS, monitoring/alerting, work management (Jira/ServiceNow), knowledge base (Confluence), and light scripting/SQL for telemetry and workflow automation
Run a metrics‑first operation: publish dashboards and KPIs (availability, PUE, MTBF/MTTR, work compliance, safety) and use them to drive decisions
Partner across functions: work with Cloud/Compute, Network, Security, and Capacity Planning to optimize performance, cost, and resiliency across the fleet
Manage vendors & colos: own contracts, SLAs, and execution for rack deliveries, PDUs, fiber/copper, and lifecycle PMs; validate colo topology and compliance
Raise the safety bar: enforce a zero‑injury EHS culture; conduct drills/audits for life safety, physical security, and data protection
Forecast and budget: build data‑backed plans for power, spares, headcount, and projects; track OpEx/CapEx with rigor
Qualification
Required
Associate's degree or trade certification in Electrical/Mechanical/Industrial Engineering (or equivalent experience)
10+ years in electrical/mechanical/HVAC/controls within industrial/commercial settings, 5+ years specifically in data center or mission-critical facilities
Team leadership experience in 24/7 sites (managing leads/techs, vendors, and on-call operations)
Deep, hands-on knowledge of UPS/generators/switchgear, chillers/CRAC/CRAH, fire detection/suppression, BMS/EPMS/DCIM, and structured cabling (copper & fiber)
Proven strength in incident management, RCA/Corrective Actions, change management, and vendor/contract oversight
Data-driven mindset with the ability to forecast resources and make analytics-backed decisions (Excel; SQL/scripting a plus)
Excellent written/verbal communication with comfort presenting to executives and guiding field teams during live events
Ability to travel up to ~30% and support after-hours escalations when needed
Preferred
Bachelor's degree in Electrical/Mechanical/Industrial Engineering, Engineering Management, or Reliability Engineering
Hyperscale/colo experience with reliability-centered maintenance, predictive analytics, and Lean/Six Sigma practices
Familiarity with Linux fundamentals, network equipment installation/troubleshooting, and fiber optics testing
Experience with Jira, Confluence, ServiceNow (or similar); strong SOP/MOP/EOP authorship
Certifications such as CDCP, DCM, PMP, OSHA-30, ITIL, or Uptime-aligned credentials
Benefits
Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
401(k) plan: up to 4% company match with immediate vesting.
Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
Remote work reimbursement: up to $85/month for mobile and internet.
Disability & life insurance: company-paid short-term, long-term and life insurance coverage.
Company
Nebius
The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.
Funding
Current Stage
Late StageTotal Funding
$1.04B2025-06-04Debt Financing· $1B
2025-05-15Grant· $45M
2024-12-02Seed
Recent News
2025-10-25
Company data provided by crunchbase