Senior GPU Cluster Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Vero · 1 week ago

Senior GPU Cluster Engineer

Vero is a well-resourced AI infrastructure startup working in close partnership with NVIDIA and other organizations. They are seeking a Senior GPU Cluster Engineer to take hands-on ownership of GPU platforms powering mission critical AI workloads at scale.

Staffing & Recruiting

Responsibilities

Run day-to-day operations across large-scale GPU clusters, including GPU servers, high-speed networking, storage, and management systems
Handle node imaging, system bring-up, lifecycle updates, and ongoing health checks across GPU infrastructure
Execute GPU validation workflows such as health checks, NVLINK verification, NCCL test jobs, and performance benchmarking
Monitor alerts, troubleshoot operational issues, and escalate complex incidents to senior engineers when needed
Support the deployment of new racks, clusters, and regional expansions in a structured and reliable way
Maintain accurate configuration records, inventory, and operational status across the GPU fleet
Improve reliability and consistency by contributing to monitoring, automation, runbooks, and operational processes

Qualification

Linux system administrationGPU-based systemsGPU architecture fundamentalsKubernetesSlurmGPU diagnosticsStructured processesProvisioning workflowsAutomated imagingNode lifecycle management

Required

Degree in Computer Science, Engineering, or a related field
Strong experience in Linux system administration, troubleshooting, and performance fundamentals
Hands-on experience operating GPU-based systems, HPC environments, or large-scale distributed compute platforms
Understanding of GPU architecture fundamentals including topology, memory behaviour, and runtime tooling
Familiarity with provisioning workflows, automated imaging, and node lifecycle management
Kubernetes & Slurm
Experience with GPU diagnostics and tooling such as NCCL testing, NVLINK health checks, and SMI utilities
Comfortable following structured processes in uptime-focused, production environments

Preferred

Experience working with NVIDIA GPUs and CUDA-enabled systems

Benefits

Medical, dental, and vision insurance for the employee and family
Equity Scheme
Bonus
401(k) with a generous employer match
Company-paid Life Insurance
Flexible Spending Account
Mental Wellness Benefits
Flexible PTO

Company

Vero

twitter
company-logo
We help founders and leaders build high-impact teams by connecting them with exceptional talent globally, with a focus in the US.

Funding

Current Stage
Early Stage
Company data provided by crunchbase