Senior Software Engineer, AI Resiliency jobs in United States
info-icon
This job has closed.
company-logo

NVIDIA · 1 day ago

Senior Software Engineer, AI Resiliency

NVIDIA is a leader in AI technology, seeking a Senior Software Engineer to lead the development of AI software resiliency for powerful AI supercomputers. This role involves implementing critical resiliency features, optimizing software for reliability, and collaborating across teams to enhance AI frameworks.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Joshua Hasten
linkedin

Responsibilities

Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection
Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs
Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation
Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA
Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads
Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads

Qualification

C++PythonDistributed systemsFault toleranceAI frameworksDebugging toolsPerformance tuningProblem-solvingCollaboration

Required

You've achieved a Bachelor's, Master's or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience
Proficiency in C++ and Python, with experience in writing efficient, high-performance code
6+ years of relevant experience
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments
Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar
Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight)
Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment

Preferred

Hands-on experience in training models or working with model training teams
Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale
Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training
Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads
Strong systems programming skills and experience with low-level performance tuning

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase