Apply on Employer Site

NVIDIA · 3 days ago

Senior Platform and EngOps Engineer - Cluster Operations

Santa Clara, CA

Full-time

Onsite

Senior Level

$168K/yr - $270K/yr

5+ years exp

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. They are looking for highly motivated EngOps and Platform Engineers to boost execution efficiency while managing and maintaining large GPU clusters interconnected via NVLink and InfiniBand.

Artificial Intelligence (AI)Consumer ElectronicsGPUHardwareSoftwareVirtual Reality

Growth Opportunities

H1B Sponsor Likely

Hiring Manager

Bella Yanovsky

Responsibilities

Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand

Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations

Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance

Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions

Collaborate effectively with dynamic Engineering and Product Teams across multiple time zones to align cluster operations with evolving project requirements

Qualification

GPU cluster managementAnsiblePythonLinux fundamentalsShell ScriptingHigh-performance applicationsNetworking technologiesTroubleshootingAlerting toolsCollaboration

Required

BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience

5+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure

Automation expert with hands on skills in Ansible, Python and Shell Scripting

Deep understanding of operating systems, computer networks, and high-performance applications

Proven ability to work effectively with developers and test engineers across different teams and time zones

Proficient with Linux fundamentals

Preferred

Familiarity with resource scheduling managers, preferably Slurm

Direct experience with industry standard alerting tools and emergency response practices

Hands-on experience with GPU-focused hardware and software, such as DGX systems and Compute Clusters

Proficiency in crafting and implementing a robust metrics collection and alerting infrastructure

Proficiency in designing large scale networking technologies and the associated challenges

Benefits

Equity

Benefits

Company

NVIDIA

Glassdoor4.6

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

Founded in 1993

Santa Clara, California, USA

10001+ employees

https://www.nvidia.com

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1877)

2024 (1355)

2023 (976)

2022 (835)

2021 (601)

2020 (529)

Funding

Current Stage

Public Company

Total Funding

$4.09B

Key Investors

ARPA-EARK Investment ManagementSoftBank Vision Fund

2023-05-09Grant· $5M

2022-08-09Post Ipo Equity· $65M

2021-02-18Post Ipo Equity

Leadership Team

Jensen Huang

Founder and CEO

Michael Kagan

Chief Technology Officer

Recent News

MarketScreener

Nvidia CEO Huang touts automotive AI at CES talk as competition mounts

2026-01-06

MarketScreener

Nvidia: Rubin-based products will be available from second half of 2026

2026-01-06

GlobeNewswire

NVIDIA Announces Alpamayo Family of Open-Source AI Models and Tools to Accelerate Safe, Reasoning-Based Autonomous Vehicle Development

2026-01-06

Company data provided by crunchbase