Senior Platform and EngOps Engineer - Cluster Operations jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 3 days ago

Senior Platform and EngOps Engineer - Cluster Operations

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. They are looking for highly motivated EngOps and Platform Engineers to boost execution efficiency while managing and maintaining large GPU clusters interconnected via NVLink and InfiniBand.

Artificial Intelligence (AI)Consumer ElectronicsGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Bella Yanovsky
linkedin

Responsibilities

Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand
Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations
Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance
Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions
Collaborate effectively with dynamic Engineering and Product Teams across multiple time zones to align cluster operations with evolving project requirements

Qualification

GPU cluster managementAnsiblePythonLinux fundamentalsShell ScriptingHigh-performance applicationsNetworking technologiesTroubleshootingAlerting toolsCollaboration

Required

BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience
5+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure
Automation expert with hands on skills in Ansible, Python and Shell Scripting
Deep understanding of operating systems, computer networks, and high-performance applications
Proven ability to work effectively with developers and test engineers across different teams and time zones
Proficient with Linux fundamentals

Preferred

Familiarity with resource scheduling managers, preferably Slurm
Direct experience with industry standard alerting tools and emergency response practices
Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters
Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure
Proficiency in designing large scale networking technologies and the associated challenges

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase