Apply on Employer Site

Gruve · 22 hours ago

Director of GPU Fleet Operations

Redwood City, CA

Full-time

Hybrid

Director/Executive

$245K/yr - $250K/yr

10+ years exp

Gruve is an innovative software services startup dedicated to transforming enterprises into AI powerhouses. They are seeking a Director of GPU Fleet Operations to manage the lifecycle, reliability, and performance of their global GPU fleet, while driving strategy and execution for hardware and infrastructure operations.

Artificial Intelligence (AI)SoftwareMachine Learning

H1B Sponsor Likely

Responsibilities

Own operational readiness, uptime, and performance of the global GPU fleet

Define and implement operational standards across OEM platforms (NVIDIA, Cisco, Dell, Supermicro, and others), GPU servers (NVIDIA, AMD, XPUs), and high-speed networking (InfiniBand/RoCE)

Standardize operations across liquid- and air-cooled environments, colocation sites, and modular data centers

Establish global processes for provisioning, monitoring, maintenance, incident response, and lifecycle management

Build and manage the full hardware lifecycle from deployment through retirement, leveraging outsourced resources for remote site operations

Develop scalable processes for diagnostics, RMA coordination, spare-parts forecasting, and reliability engineering

Define and track fleet SLOs/SLAs including availability, MTTR, MTBF, and utilization

Build and lead a 24×7 global remote operations organization

Develop a remote-first model to manage distributed clusters

Implement standardized runbooks, escalation paths, and observability across hardware, performance, power, cooling, and environmental telemetry

Partner with Platform/DevOps teams to maintain cluster software stacks (Kubernetes, Slurm, Kubeflow)

Oversee GPU drivers, firmware, CUDA stack, and configuration automation

Own patching, upgrades, change management, and low-impact maintenance practices

Manage platform layers operating above Kubernetes, including agent infrastructure

Lead adoption of AI/ML for predictive failure detection, anomaly detection, alert triage, and automated remediation

Build toward an autonomous, self-healing GPU fleet through data-driven automation

Manage OEM and repair vendor relationships and enforce SLAs

Coordinate global field technicians and remote hands support

Partner with Customer Success and Capacity Planning teams to ensure GPU availability and performance

Support large-scale deployments, escalations, and on-premise customer installations

Hire and lead teams across hardware operations, reliability engineering, NOC, and automation engineering

Establish KPIs, dashboards, and operational reporting to support rapid growth

Qualification

GPU operationsHigh-performance computingCloud operationsKubernetesAI-driven automationLinux systemsIncident responseVendor managementCapacity planningTeam leadership

Required

10+ years of experience in infrastructure, data center, or cloud operations

5+ years managing distributed hardware fleets or large-scale compute environments

Experience operating GPU, HPC, or high-performance compute clusters

Proven experience leading 24×7 operations teams

Strong technical understanding of: GPU servers and accelerator infrastructure, High-speed networking (InfiniBand/RoCE), Linux systems and hardware troubleshooting, Cluster orchestration (Kubernetes, Slurm, or similar), Monitoring and observability platforms, Hardware lifecycle management and RMA processes, Incident response and SRE practices

Experience applying automation or AI-driven approaches such as AIOps, telemetry analytics, predictive maintenance, and self-healing workflows

Preferred

Experience working in GPU Cloud, Neo Cloud, or AI infrastructure environments

Familiarity with liquid-cooled data centers

Experience managing distributed edge or modular data center deployments

Background in Site Reliability Engineering, Reliability Engineering, or HPC operations

Experience building automation at hyperscale or large fleet environments

Demonstrated ability to scale global technical teams and operate effectively in fast-growth startup settings

Benefits

Performance Bonus

Equity

Company

Gruve

Gruve is a startup focused on transforming AI strategies into tangible outcomes for enterprises.

Founded in 2024

San Francisco, California, USA

501-1000 employees

https://www.gruve.ai

H1B Sponsorship

Gruve has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (5)

Funding

Current Stage

Late Stage

Total Funding

$87.5M

Key Investors

Xora InnovationMayfield Fund

2026-02-03Series A· $50M

2025-04-30Series A· $20M

2025-04-30Seed· $17.5M

Recent News

SiliconANGLE

AI torches software stocks, even as investors fret about all that AI factory spending

2026-02-06

Tech Startups - Tech News, Tech Trends & Startup Funding

75% of Bitcoin’s source code allegedly came from Jeffrey Epstein’s investments, newly released FBI Epstein Files reveal

2026-02-04

Tech Startups - Tech News, Tech Trends & Startup Funding

Top Startup and Tech Funding News – February 3, 2025

2026-02-04

Company data provided by crunchbase