Gruve · 22 hours ago
Director of GPU Fleet Operations
Gruve is an innovative software services startup dedicated to transforming enterprises into AI powerhouses. They are seeking a Director of GPU Fleet Operations to manage the lifecycle, reliability, and performance of their global GPU fleet, while driving strategy and execution for hardware and infrastructure operations.
Artificial Intelligence (AI)SoftwareMachine Learning
Responsibilities
Own operational readiness, uptime, and performance of the global GPU fleet
Define and implement operational standards across OEM platforms (NVIDIA, Cisco, Dell, Supermicro, and others), GPU servers (NVIDIA, AMD, XPUs), and high-speed networking (InfiniBand/RoCE)
Standardize operations across liquid- and air-cooled environments, colocation sites, and modular data centers
Establish global processes for provisioning, monitoring, maintenance, incident response, and lifecycle management
Build and manage the full hardware lifecycle from deployment through retirement, leveraging outsourced resources for remote site operations
Develop scalable processes for diagnostics, RMA coordination, spare-parts forecasting, and reliability engineering
Define and track fleet SLOs/SLAs including availability, MTTR, MTBF, and utilization
Build and lead a 24×7 global remote operations organization
Develop a remote-first model to manage distributed clusters
Implement standardized runbooks, escalation paths, and observability across hardware, performance, power, cooling, and environmental telemetry
Partner with Platform/DevOps teams to maintain cluster software stacks (Kubernetes, Slurm, Kubeflow)
Oversee GPU drivers, firmware, CUDA stack, and configuration automation
Own patching, upgrades, change management, and low-impact maintenance practices
Manage platform layers operating above Kubernetes, including agent infrastructure
Lead adoption of AI/ML for predictive failure detection, anomaly detection, alert triage, and automated remediation
Build toward an autonomous, self-healing GPU fleet through data-driven automation
Manage OEM and repair vendor relationships and enforce SLAs
Coordinate global field technicians and remote hands support
Partner with Customer Success and Capacity Planning teams to ensure GPU availability and performance
Support large-scale deployments, escalations, and on-premise customer installations
Hire and lead teams across hardware operations, reliability engineering, NOC, and automation engineering
Establish KPIs, dashboards, and operational reporting to support rapid growth
Qualification
Required
10+ years of experience in infrastructure, data center, or cloud operations
5+ years managing distributed hardware fleets or large-scale compute environments
Experience operating GPU, HPC, or high-performance compute clusters
Proven experience leading 24×7 operations teams
Strong technical understanding of: GPU servers and accelerator infrastructure, High-speed networking (InfiniBand/RoCE), Linux systems and hardware troubleshooting, Cluster orchestration (Kubernetes, Slurm, or similar), Monitoring and observability platforms, Hardware lifecycle management and RMA processes, Incident response and SRE practices
Experience applying automation or AI-driven approaches such as AIOps, telemetry analytics, predictive maintenance, and self-healing workflows
Preferred
Experience working in GPU Cloud, Neo Cloud, or AI infrastructure environments
Familiarity with liquid-cooled data centers
Experience managing distributed edge or modular data center deployments
Background in Site Reliability Engineering, Reliability Engineering, or HPC operations
Experience building automation at hyperscale or large fleet environments
Demonstrated ability to scale global technical teams and operate effectively in fast-growth startup settings
Benefits
Performance Bonus
Equity
Company
Gruve
Gruve is a startup focused on transforming AI strategies into tangible outcomes for enterprises.
H1B Sponsorship
Gruve has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (5)
Funding
Current Stage
Late StageTotal Funding
$87.5MKey Investors
Xora InnovationMayfield Fund
2026-02-03Series A· $50M
2025-04-30Series A· $20M
2025-04-30Seed· $17.5M
Recent News
2026-02-06
Tech Startups - Tech News, Tech Trends & Startup Funding
2026-02-04
Tech Startups - Tech News, Tech Trends & Startup Funding
2026-02-04
Company data provided by crunchbase