NVIDIA · 1 day ago
Senior DGX Cloud Software Engineer - Infrastructure Automation and Distributed Systems
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. As part of the DGX Cloud team, the Senior DGX Cloud Software Engineer will design, build, and run cloud infrastructure services to support AI training and inference development.
Responsibilities
Design, build, and run cloud infrastructure services in scope to meet our business goals performing integrations, migrations, bringups, updates, and decommissions as necessary
Participate in the definition of our internal facing service level objectives and error budgets as part of our overall observability strategy
Eliminate toil or automate it where the ROI of building and maintaining automation is worth it
Practice sustainable blameless incident prevention and incident response while being a member of an on-call rotation
Consult with and provide consultation for peer teams on systems design best practices
Participate in a supportive culture of values-driven introspection, communication, and self-organization
Qualification
Required
Proficiency in one or more of the following programming languages: Python or Go
BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics) or equivalent experience
5+ years of relevant experience in infrastructure and fleet management engineering
Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems at scales requiring fully automated management and under active customer consumption in production
A track record demonstrating a mix of initiating your own projects, convincing others to collaborate with you, and collaborating well on projects initiated by others
In-depth knowledge in one or more of the following: Linux, Slurm, Kubernetes, Local and Distributed Storage, and Systems Networking
Preferred
Demonstrating a systematic problem-solving approach, coupled with clear communication skills and a willingness to take ownership and get results such as experience driving a build / reuse / buy decision
Experience working with or developing bare metal as a service (BMaaS) associated systems. For example, vending BMaaS, or Slurm running on containers, or vending Kubernetes clusters
Experience working with or developing multi-cloud infrastructure services
Experience teaching reliability engineering (e.g. SRE) and/or other scale-oriented cloud systems practices to peers and/or other companies (e.g. CRE)
Experience in running private or public cloud systems based on one or more of Kubernetes, OpenStack, Docker or Slurm
Experience with accelerated compute and communications technologies such BlueField Networking, Infiniband topologies, NVMesh, and/or the NVIDIA Collective Communication Library (NCCL)
Experience working with a centralized security organization to prioritize and mitigate security risks
Prior experience in a ML/AI focused role or on a team matching specific keywords is welcome but not required
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
2026-01-06
2026-01-06
Company data provided by crunchbase