NVIDIA · 1 day ago
Datacenter Resiliency Architect - New College Grad 2025
NVIDIA is a leading company in AI and computing technology, seeking a Resiliency Architect to support the development and validation of GPU hardware and software resiliency features. The role involves architecting resiliency features, analyzing metrics, and collaborating with teams to enhance system reliability and performance in datacenters.
Artificial Intelligence (AI)Consumer ElectronicsGPUHardwareSoftwareVirtual Reality
Responsibilities
Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter
Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter. Use models to identify gaps and drive RAS improvements
Collaborate with architects, unit designers and software engineers to ensure alignment of verification requirements
Develop and implement comprehensive architecture verification testplans for resiliency features
Execute Architecture Testplan by developing test content, working with Software and Architecture teams to enable, run, and debug tests on Architecture models. Support test debug on RTL, emulation, and silicon
Run simulations to analyze Architectural Vulnerability Factor and Liveness of on-die memory, flip-flops, and latches
Develop CUDA software diagnostics kernels for to run on clusters of NVIDIA GPUs and identify potential hardware issues
Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon and other environments
Qualification
Required
Pursuing or recently completed a Master's or PhD degree in Computer Engineering, Electrical Engineering or closely related degree or equivalent experience
Familiarity with GPU and Networking Architectures, Computer Architecture basics (including caches, coherence, buses, direct memory access, etc.); Machine Learning/Deep Learning concepts
Proficiency in RAS concepts and in developing Architecture models
Scripting and automation with Python or similar
Proficiency in C/C++
Excellent interpersonal skills and ability to collaborate with on-site and remote teams
Strong debugging and analytical skills
Be self-driven and results oriented
Preferred
Experience with resiliency and datacenter RAS
Proficiency in Verilog/System Verilog RTL simulations and debug. Ability to set up test benches and integrate various components
Programming with CUDA
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
2026-01-08
Company data provided by crunchbase