78 applicants

Company

NVIDIA · 16 hours ago

Senior Site Reliability Engineer, Omniverse Cloud Platform

New Jersey, United States

Full-time

Remote

Senior Level, Lead/Staff

$164K/yr - $328K/yr

8+ years exp

Maximize your interview chances

Artificial Intelligence (AI)GPU

Growth Opportunities

H1B Sponsor Likely

Insider Connection @NVIDIA

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Own, innovate, and build programs, new software, and analytics that drive improvements to the availability, scalability, latency, and efficiency of Omniverse products and services

Handle upgrades, and automated rollbacks across all clusters

Maintain Service Level Agreement (SLAs) of measurable benchmarks, working hand in hand with developers of new services on how to define SLIs, and design a stable, secure service

Help guide the Change Advisory Board, and RCCA processes

Work with product area leads from technologies across NVIDIA to guide product engineering to build fast, reliable, and durable production systems

Apply standard methodologies and first principled thinking to Omniverse and other strategic Cloud offerings from NVIDIA.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

System DesignUnix/Linux SystemsC++PythonKubernetesIncident ManagementLarge Scale CoordinationHPCPaaSSaaSMonitoring StacksOpen TelemetryObservability StacksMachine LearningModel Training

Required

Bachelor's degree in Computer Science or a related field, or equivalent experience

8+ years of demonstrated competency in system design, complexity analysis, software design in Unix/Linux systems, performance, and application issues

8+ years' of validated experience authoring, and debugging software written in C++ and Python

Deep hands-on experience with Kubernetes based cloud environments

Proven experience in incident management and large scale incident coordination.

Experience working with partners across multiple teams

Background with HPC or Model Training Operations or related experience.

Preferred

Multiple CSP expertise.

Experience with Monitoring stacks, Open Telemetry and sophisticated Observability stacks

Background with PaaS, and SaaS offerings

Experience in Highly available and large scale environment support and reliability.

Experience in Machine Learning and Model Training

Benefits

Equity and benefits

Company

NVIDIA

Glassdoor

4.6

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

Founded in 1993

Santa Clara, California, USA

10001+ employees

https://www.nvidia.com

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2023 (735)

2022 (892)

2021 (696)

2020 (534)