Apply on Employer Site

NVIDIA · 18 hours ago

Site Reliability Engineer - Hardware Infrastructure

Santa Clara, CA

Full-time

Onsite

Senior Level, Lead/Staff

$168K/yr - $334K/yr

8+ years exp

NVIDIA is a leading technology company that specializes in graphics processing units and AI technology. They are seeking a Site Reliability Engineer to define, develop, and support large-scale production systems, ensuring high efficiency and availability while collaborating with teams to enhance system reliability and uptime.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Develop and support guidelines for incident management, planned maintenance, and blameless postmortems

Assist teams in responding to high severity incidents, driving root cause analysis, crafting high-quality postmortems, and developing post-incident corrective actions

Define reliability and supportability metrics, Service Level Objectives, and error budgets

Develop and drive the adoption of actionable, customer-centric monitoring and alerting

Apply automation and Generative AI/Agentic solutions to minimize manual and tedious activities and boost customer support

Guide teams on establishing sustainable on-call and operational standards

Qualification

SRE principlesInfrastructure automationPythonObservability platformsIncident managementGenerative AI solutionsCommunication skillsFlexibilityAdaptability

Required

Degree in Computer Science or a related technical field involving coding, or equivalent experience

8+ years of experience in SRE, DevOps, or Production Engineering

Strong understanding of SRE principles, including incident management, error budgets, SLOs, and SLAs

Experience crafting and deploying systems that are fault-tolerant, performant, and supportable

Background with infrastructure automation

Experience running critical services in production

Experience in one or more of the following: Python, Go, Perl, or Ruby

Hands-on experience with observability platforms (e.g., Prometheus, Grafana)

Strong communication skills with the ability to convey technical concepts effectively to diverse audiences

Flexibility and adaptability working in a fast-paced environment with evolving requirements

Preferred

Expertise in establishing incident management and postmortem processes

Experience driving adoption of common tools and processes across diverse groups

Experience working with LLM/Generative AI/Agentic solutions to shorten mitigation time, lessen toil, and ensure Service Level Objectives are met

Hands-on expertise operating and scaling distributed systems with tight SLAs, ensuring high availability and performance

Benefits

Equity

Benefits

Company

NVIDIA

Glassdoor4.6

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

Founded in 1993

Santa Clara, California, USA

10001+ employees

https://www.nvidia.com

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1877)

2024 (1355)

2023 (976)

2022 (835)

2021 (601)

2020 (529)

Funding

Current Stage

Public Company

Total Funding

$4.09B

Key Investors

ARPA-EARK Investment ManagementSoftBank Vision Fund

2023-05-09Grant· $5M

2022-08-09Post Ipo Equity· $65M

2021-02-18Post Ipo Equity

Leadership Team

Jensen Huang

Founder and CEO

Michael Kagan

Chief Technology Officer

Recent News

Sherwood News

Here’s a rundown of the AI-powered humanoid robots that tech companies want to be your new roommates

2026-01-16

FierceBiotech

Lilly, Nvidia tag on partnership with new AI co-innovation lab, $1B investment

2026-01-16

The Motley Fool

Will Nvidia Stock Fall Below $100 in 2026? Here's What History Has to Say.

2026-01-16

Company data provided by crunchbase