NVIDIA · 4 days ago
Product Manager, Health Automation and Resilience
NVIDIA is searching for a highly technical Product Manager to guide Health Automation and Resilience efforts for AI infrastructure. This role involves developing products for fault detection, automated repair workflows, and resilience tooling to improve GPU fleet performance and enhance cloud provider efficiency.
AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
Responsibilities
Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets
Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components
Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention
Work with cloud providers and enterprise operators to understand failure modes and operational challenges
Develop product specifications, technical requirements, and validation criteria for both internal and open-source components
Support go-to-market activities including documentation, demos, partner enablement, and release readiness
Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy
Lead product technical reviews, customer conversations, and planning sessions
Qualification
Required
Bachelor's degree in Computer Science, Engineering, or a similar area, or equivalent experience
8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields
Track record defining multi-quarter strategy and leading execution with multiple engineering teams
Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows
Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems
Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments
Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments
Experience working with open-source technologies or products for software developers
Excellent communication skills across engineering, customers, and executives
Preferred
Experience with GPU-accelerated compute, HPC systems, or large-scale AI clusters
Knowledge of Kubernetes operators, node health workflows, autoscaling, or control-plane automation
Experience with modern observability and diagnostics technologies such as Prometheus, OpenTelemetry, eBPF, or distributed tracing
Contributions to infrastructure or reliability open-source communities
Experience writing detailed build documents for software agents, distributed services, or platform-level components
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
2026-01-14
Business Standard India
2026-01-14
Company data provided by crunchbase