Product Manager, Health Automation and Resilience jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 4 days ago

Product Manager, Health Automation and Resilience

NVIDIA is searching for a highly technical Product Manager to guide Health Automation and Resilience efforts for AI infrastructure. This role involves developing products for fault detection, automated repair workflows, and resilience tooling to improve GPU fleet performance and enhance cloud provider efficiency.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets
Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components
Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention
Work with cloud providers and enterprise operators to understand failure modes and operational challenges
Develop product specifications, technical requirements, and validation criteria for both internal and open-source components
Support go-to-market activities including documentation, demos, partner enablement, and release readiness
Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy
Lead product technical reviews, customer conversations, and planning sessions

Qualification

Cloud infrastructureDistributed systemsReliability engineeringAutomation systemsTelemetry systemsGPU hardwareOpen-source technologiesProduct requirements craftingTechnical decision makingCommunication

Required

Bachelor's degree in Computer Science, Engineering, or a similar area, or equivalent experience
8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields
Track record defining multi-quarter strategy and leading execution with multiple engineering teams
Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows
Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems
Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments
Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments
Experience working with open-source technologies or products for software developers
Excellent communication skills across engineering, customers, and executives

Preferred

Experience with GPU-accelerated compute, HPC systems, or large-scale AI clusters
Knowledge of Kubernetes operators, node health workflows, autoscaling, or control-plane automation
Experience with modern observability and diagnostics technologies such as Prometheus, OpenTelemetry, eBPF, or distributed tracing
Contributions to infrastructure or reliability open-source communities
Experience writing detailed build documents for software agents, distributed services, or platform-level components

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase