NVIDIA · 1 day ago
Senior Software Engineer, AI Resiliency
NVIDIA is a leader in AI technology, seeking a Senior Software Engineer to lead the development of AI software resiliency for powerful AI supercomputers. This role involves implementing critical resiliency features, optimizing software for reliability, and collaborating across teams to enhance AI frameworks.
Responsibilities
Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection
Hands-On Coding & Optimization: Contribute to large-scale distributed systems with high-quality, production-level C++ and Python code. Enhance performance for AI workloads running on thousands of GPUs
Fault Tolerance & Debugging: Work on AI system error handling, implementing techniques to detect silent data corruption (SDC) and other failure scenarios. Assist in developing monitoring tools for proactive failure mitigation
Collaborate Across Teams: Work closely with senior engineers, AI researchers, and hardware/software teams to integrate resiliency features into AI frameworks like PyTorch and JAX/XLA
Testing & Automation: Develop and implement tests to ensure robustness, scalability, and efficiency of resiliency mechanisms. Contribute to CI/CD pipelines to automate validation of AI workloads
Support Production Deployments: Assist in debugging and performance tuning large-scale AI workloads in cloud and HPC environments, ensuring seamless operation of AI training and inference workloads
Qualification
Required
You've achieved a Bachelor's, Master's or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience
Proficiency in C++ and Python, with experience in writing efficient, high-performance code
6+ years of relevant experience
Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments
Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar
Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight)
Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment
Preferred
Hands-on experience in training models or working with model training teams
Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing, especially at extreme-scale
Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training
Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads
Strong systems programming skills and experience with low-level performance tuning
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
Business Insider
2026-01-09
Business Insider
2026-01-09
Company data provided by crunchbase