Site Reliability Engineer - Hardware Infrastructure jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 18 hours ago

Site Reliability Engineer - Hardware Infrastructure

NVIDIA is a leading technology company that specializes in graphics processing units and AI technology. They are seeking a Site Reliability Engineer to define, develop, and support large-scale production systems, ensuring high efficiency and availability while collaborating with teams to enhance system reliability and uptime.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Develop and support guidelines for incident management, planned maintenance, and blameless postmortems
Assist teams in responding to high severity incidents, driving root cause analysis, crafting high-quality postmortems, and developing post-incident corrective actions
Define reliability and supportability metrics, Service Level Objectives, and error budgets
Develop and drive the adoption of actionable, customer-centric monitoring and alerting
Apply automation and Generative AI/Agentic solutions to minimize manual and tedious activities and boost customer support
Guide teams on establishing sustainable on-call and operational standards

Qualification

SRE principlesInfrastructure automationPythonObservability platformsIncident managementGenerative AI solutionsCommunication skillsFlexibilityAdaptability

Required

Degree in Computer Science or a related technical field involving coding, or equivalent experience
8+ years of experience in SRE, DevOps, or Production Engineering
Strong understanding of SRE principles, including incident management, error budgets, SLOs, and SLAs
Experience crafting and deploying systems that are fault-tolerant, performant, and supportable
Background with infrastructure automation
Experience running critical services in production
Experience in one or more of the following: Python, Go, Perl, or Ruby
Hands-on experience with observability platforms (e.g., Prometheus, Grafana)
Strong communication skills with the ability to convey technical concepts effectively to diverse audiences
Flexibility and adaptability working in a fast-paced environment with evolving requirements

Preferred

Expertise in establishing incident management and postmortem processes
Experience driving adoption of common tools and processes across diverse groups
Experience working with LLM/Generative AI/Agentic solutions to shorten mitigation time, lessen toil, and ensure Service Level Objectives are met
Hands-on expertise operating and scaling distributed systems with tight SLAs, ensuring high availability and performance

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase