57 applicants

Company

NVIDIA · 10 hours ago

Senior DGX Cloud Software Engineer- Infrastructure Automation and Distributed Systems

Oregon, United States

Full-time

Remote

Senior Level

$148K/yr - $276K/yr

5+ years exp

Maximize your interview chances

Artificial Intelligence (AI)GPU

Growth Opportunities

H1B Sponsor Likely

Hiring Manager

Bella Yanovsky

Insider Connection @NVIDIA

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Design, build, and run cloud infrastructure services in scope to meet our business goals performing integrations, migrations, bringups, updates, and decommissions as necessary.

Participate in the definition of our internal facing service level objectives and error budgets as part of our overall observability strategy.

Eliminate toil or automate it where the ROI of building and maintaining automation is worth it.

Practice sustainable blameless incident prevention and incident response while being a member of an oncall rotation.

Consult with and provide consultation for peer teams on systems design best practices.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Cloud infrastructure servicesInfrastructure automationDistributed systems designPythonLinuxGoPerlRubyStorageContainersKubernetesOpenStackDockerSlurmNvidia Collective Communication LibrarySense of ownership

Required

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics) or equivalent experience.

5+ years of relevant experience.

A track record showing a good balance between initiating your own projects, convincing others to collaborate with you, and collaborating well on projects initiated by others.

Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems in production.

Experience in one or more of the following: Python, Go or C++.

In depth knowledge in one or more of the following: Linux, Slurm, Kubernetes, Networking, Storage, and Containers.

Preferred

Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

Experience working with or developing bare metal as a service (BMaaS) associated systems. For example, vending BMaaS, or Slurm running on containers, or vending Kubernetes clusters.

Experience working with or developing multi-cloud infrastructure services.

Experience teaching reliability (e.g. SRE) or more general cloud systems good practices to peers or to other companies (e.g. CRE).

Experience in running private or public cloud systems based on one or more of Kubernetes, OpenStack, Docker or Slurm.

Experience with NVIDIA Collective Communication Library (NCCL).

Experience working with a centralized security organization to prioritize and mitigate security risks.

Experience balancing build vs reuse vs buy.

No prior experience having worked in a team of any particular name or having worked in a ML/AI focused team are required but also a nice to have.

Benefits

Equity and benefits

Company

NVIDIA

Glassdoor

4.6

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

Founded in 1993

Santa Clara, California, USA

10,001+ employees

https://www.nvidia.com

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2023 (735)

2022 (892)

2021 (696)

2020 (534)