NVIDIA · 12 hours ago
Senior DGX Cloud Software Engineer- Infrastructure Automation and Distributed Systems
Maximize your interview chances
Insider Connection @NVIDIA
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Design, build, and run cloud infrastructure services in scope to meet our business goals performing integrations, migrations, bringups, updates, and decommissions as necessary.
Participate in the definition of our internal facing service level objectives and error budgets as part of our overall observability strategy.
Eliminate toil or automate it where the ROI of building and maintaining automation is worth it.
Practice sustainable blameless incident prevention and incident response while being a member of an oncall rotation.
Consult with and provide consultation for peer teams on systems design best practices.
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
5+ years of relevant experience.
A track record showing a good balance between initiating your own projects, convincing others to collaborate with you, and collaborating well on projects initiated by others.
Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems in production.
Experience in one or more of the following: Python, Go or C++.
In depth knowledge in one or more of the following: Linux, Slurm, Kubernetes, Networking, Storage, and Containers.
Preferred
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Experience working with or developing bare metal as a service (BMaaS) associated systems. For example, vending BMaaS, or Slurm running on containers, or vending Kubernetes clusters.
Experience working with or developing multi-cloud infrastructure services.
Experience teaching reliability (e.g. SRE) or more general cloud systems good practices to peers or to other companies (e.g. CRE).
Experience in running private or public cloud systems based on one or more of Kubernetes, OpenStack, Docker or Slurm.
Experience with NVIDIA Collective Communication Library (NCCL).
Experience working with a centralized security organization to prioritize and mitigate security risks.
Experience balancing build vs reuse vs buy.
No prior experience having worked in a team of any particular name or having worked in a ML/AI focused team are required but also a nice to have.
Benefits
Equity and benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2023 (735)
2022 (892)
2021 (696)
2020 (534)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity· undefined
Recent News
2024-11-26
Company data provided by crunchbase