Senior GPU and HPC Infrastructure Engineer - DGX Cloud jobs in United States
info-icon
This job has closed.
company-logo

NVIDIA · 4 days ago

Senior GPU and HPC Infrastructure Engineer - DGX Cloud

NVIDIA is hiring engineers to scale up its AI Infrastructure. The role involves contributing to the platform that automates GPU asset provisioning and lifecycle management, ensuring reliability and scalability of GPU assets, and collaborating with engineering teams to integrate software seamlessly from hardware to AI applications.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

We have built a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers. You'll contribute to this platform to build end-to-end automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems
Implement monitoring and health management capabilities that enable industry-leading reliability, availability, and scalability of GPU assets. You will be harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry
Work on software that manages NVLINK topography across GPU clusters
Build automated test infrastructure that we use to qualify distributed systems for operation
Work with engineering teams across NVIDIA to ensure your software integrates seamlessly from the hardware all the way up to the AI training applications
You'll be constantly innovating, discovering new problems and their solutions

Qualification

GPU knowledgeHPC experienceLinux administrationDistributed systemsSystems programming GoSystems programming PythonCluster management KubernetesCluster management SLURMData structureAlgorithmsCommunication skillsProblem-solvingTeam collaboration

Required

Strong programming background
Knowledge of datacenter hardware, operations, and networking
Familiarity with software testing and deployment
Familiarity with distributed systems
Excellent communication and planning abilities
5+ years of software engineering experience on large-scale production systems
Possess a BS in Computer Science/Engineering/Physics/Mathematics or other comparable Degree or equivalent experience
Expert level knowledge of a systems programming language (Go, Python)
Solid understanding of Data Structure and Algorithms
Expert level knowledge of Linux system administration and management
Understanding of cluster management systems (Kubernetes, SLURM)
Understanding of performance, security and reliability in complex distributed systems
Familiarity with system level architecture, data synchronization, fault tolerance and state management

Preferred

Experience working with High Performance Computing (HPC), GPUs, and high-performance networking (RDMA, Infiniband, RoCE)
Proficiency in architecting and managing large-scale distributed systems, independent of cloud providers
Deep knowledge of datacenter operations and GPU hardware
Hands-on experience working with RDMA networking
Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, SLURM)
Hands-on experience in Machine Learning Operations
Hands-on experience with Bright Cluster Manager
Hands-on experience developing and/or operating hardware fleet management systems
Proven operational excellence in designing and maintaining AI infrastructure

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase