NVIDIA · 4 days ago
Senior GPU and HPC Infrastructure Engineer - DGX Cloud
NVIDIA is hiring engineers to scale up its AI Infrastructure. The role involves contributing to the platform that automates GPU asset provisioning and lifecycle management, ensuring reliability and scalability of GPU assets, and collaborating with engineering teams to integrate software seamlessly from hardware to AI applications.
AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
Responsibilities
We have built a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers. You'll contribute to this platform to build end-to-end automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems
Implement monitoring and health management capabilities that enable industry-leading reliability, availability, and scalability of GPU assets. You will be harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry
Work on software that manages NVLINK topography across GPU clusters
Build automated test infrastructure that we use to qualify distributed systems for operation
Work with engineering teams across NVIDIA to ensure your software integrates seamlessly from the hardware all the way up to the AI training applications
You'll be constantly innovating, discovering new problems and their solutions
Qualification
Required
Strong programming background
Knowledge of datacenter hardware, operations, and networking
Familiarity with software testing and deployment
Familiarity with distributed systems
Excellent communication and planning abilities
5+ years of software engineering experience on large-scale production systems
Possess a BS in Computer Science/Engineering/Physics/Mathematics or other comparable Degree or equivalent experience
Expert level knowledge of a systems programming language (Go, Python)
Solid understanding of Data Structure and Algorithms
Expert level knowledge of Linux system administration and management
Understanding of cluster management systems (Kubernetes, SLURM)
Understanding of performance, security and reliability in complex distributed systems
Familiarity with system level architecture, data synchronization, fault tolerance and state management
Preferred
Experience working with High Performance Computing (HPC), GPUs, and high-performance networking (RDMA, Infiniband, RoCE)
Proficiency in architecting and managing large-scale distributed systems, independent of cloud providers
Deep knowledge of datacenter operations and GPU hardware
Hands-on experience working with RDMA networking
Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, SLURM)
Hands-on experience in Machine Learning Operations
Hands-on experience with Bright Cluster Manager
Hands-on experience developing and/or operating hardware fleet management systems
Proven operational excellence in designing and maintaining AI infrastructure
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
2026-01-11
2026-01-11
Company data provided by crunchbase