NVIDIA · 1 day ago
Senior Datacenter Resiliency Architect
NVIDIA is a leading company in the field of artificial intelligence and high-performance computing. They are seeking a Senior Datacenter Resiliency Architect to develop and validate GPU hardware and software resiliency features, ensuring system reliability and performance in datacenters.
Responsibilities
Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter
Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter. Use models to identify gaps and drive RAS improvements
Collaborate with architects, unit designers and software engineers to ensure alignment of verification requirements
Develop and implement comprehensive architecture verification testplans for resiliency features
Execute Architecture Testplan by developing test content, working with Software and Architecture teams to enable, run, and debug tests on Architecture models. Support test debug on RTL, emulation, and silicon
Run simulations to analyze Architectural Vulnerability Factor and Liveness of on-die memory, flip-flops, and latches
Develop CUDA software diagnostics kernels for to run on clusters of NVIDIA GPUs and identify potential hardware issues
Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon and other environments
Qualification
Required
Master's or PhD degree in Computer Engineering, Electrical Engineering or closely related degree or equivalent experience
At least 5+ years of relevant experience
Familiarity with GPU and Networking Architectures, Computer Architecture basics (including caches, coherence, buses, direct memory access, etc.); Machine Learning/Deep Learning concepts
Strong knowledge and industry expertise in either GPU hardware architecture or RAS features or both
Proficiency in developing Architecture models
Scripting and automation with Python or similar
Proficiency in C/C++
Excellent interpersonal skills and ability to collaborate with on-site and remote teams
Strong debugging and analytical skills
Be self-driven and results oriented
Preferred
Experience with resiliency and datacenter RAS
Proficiency in Verilog/System Verilog RTL simulations and debug. Ability to set up test benches and integrate various components
Programming with CUDA
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
2026-01-04
The Motley Fool
2026-01-04
2026-01-04
Company data provided by crunchbase