Senior Datacenter Resiliency Architect jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 2 days ago

Senior Datacenter Resiliency Architect

NVIDIA is a leading company in the field of artificial intelligence and high-performance computing. They are seeking a Senior Datacenter Resiliency Architect to develop and validate GPU hardware and software resiliency features, ensuring system reliability and performance in datacenters.

Artificial Intelligence (AI)Consumer ElectronicsGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Joshua Hasten
linkedin

Responsibilities

Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter
Model and analyze RAS metrics like Failures in Time for permanent and transient errors, and Availability from GPU to Rack to Datacenter. Use models to identify gaps and drive RAS improvements
Collaborate with architects, unit designers and software engineers to ensure alignment of verification requirements
Develop and implement comprehensive architecture verification testplans for resiliency features
Execute Architecture Testplan by developing test content, working with Software and Architecture teams to enable, run, and debug tests on Architecture models. Support test debug on RTL, emulation, and silicon
Run simulations to analyze Architectural Vulnerability Factor and Liveness of on-die memory, flip-flops, and latches
Develop CUDA software diagnostics kernels for to run on clusters of NVIDIA GPUs and identify potential hardware issues
Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlist, RTL, architectural model, silicon and other environments

Qualification

GPU hardware architectureRAS featuresArchitecture modelsC/C++Python scriptingVerilog/System VerilogMachine Learning conceptsDebugging skillsAnalytical skillsInterpersonal skillsSelf-driven

Required

Master's or PhD degree in Computer Engineering, Electrical Engineering or closely related degree or equivalent experience
At least 5+ years of relevant experience
Familiarity with GPU and Networking Architectures, Computer Architecture basics (including caches, coherence, buses, direct memory access, etc.); Machine Learning/Deep Learning concepts
Strong knowledge and industry expertise in either GPU hardware architecture or RAS features or both
Proficiency in developing Architecture models
Scripting and automation with Python or similar
Proficiency in C/C++
Excellent interpersonal skills and ability to collaborate with on-site and remote teams
Strong debugging and analytical skills
Be self-driven and results oriented

Preferred

Experience with resiliency and datacenter RAS
Proficiency in Verilog/System Verilog RTL simulations and debug. Ability to set up test benches and integrate various components
Programming with CUDA

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase