NVIDIA · 12 hours ago
Site Reliability Engineer - Hardware Infrastructure
NVIDIA is a leading technology company that specializes in graphics processing units and AI technology. They are seeking a Site Reliability Engineer to define, develop, and support large-scale production systems, ensuring high efficiency and availability while collaborating with teams to enhance system reliability and uptime.
AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
Responsibilities
Develop and support guidelines for incident management, planned maintenance, and blameless postmortems
Assist teams in responding to high severity incidents, driving root cause analysis, crafting high-quality postmortems, and developing post-incident corrective actions
Define reliability and supportability metrics, Service Level Objectives, and error budgets
Develop and drive the adoption of actionable, customer-centric monitoring and alerting
Apply automation and Generative AI/Agentic solutions to minimize manual and tedious activities and boost customer support
Guide teams on establishing sustainable on-call and operational standards
Qualification
Required
Degree in Computer Science or a related technical field involving coding, or equivalent experience
8+ years of experience in SRE, DevOps, or Production Engineering
Strong understanding of SRE principles, including incident management, error budgets, SLOs, and SLAs
Experience crafting and deploying systems that are fault-tolerant, performant, and supportable
Background with infrastructure automation
Experience running critical services in production
Experience in one or more of the following: Python, Go, Perl, or Ruby
Hands-on experience with observability platforms (e.g., Prometheus, Grafana)
Strong communication skills with the ability to convey technical concepts effectively to diverse audiences
Flexibility and adaptability working in a fast-paced environment with evolving requirements
Preferred
Expertise in establishing incident management and postmortem processes
Experience driving adoption of common tools and processes across diverse groups
Experience working with LLM/Generative AI/Agentic solutions to shorten mitigation time, lessen toil, and ensure Service Level Objectives are met
Hands-on expertise operating and scaling distributed systems with tight SLAs, ensuring high availability and performance
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
2026-01-15
2026-01-14
Company data provided by crunchbase