Senior Product Manager - Observability and Resilience jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 8 hours ago

Senior Product Manager - Observability and Resilience

NVIDIA is a leading company in AI-powered applications, focusing on simplifying and delivering predictability for AI workflows. The Product Manager will lead the development of tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms, enabling customers to operate complex AI training and inference workloads efficiently.

Artificial Intelligence (AI)Consumer ElectronicsGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Be a subject‑matter expert on resiliency and observability. Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs. Master modern reliability architectures. Keep up-to-date with the industry trends
Build for all that want to use. Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners
Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of‑concepts
Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight
Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition

Qualification

GPU observabilityAI/ML infrastructureHigh-performance computingCloud technologiesSecure telemetry pipelinesData-driven approachModern observability stacksCross-functional executionDistributed systemsMLOps experienceContainerization technologiesNetwork architecture

Required

BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product‑management experience in enterprise technology
Experience with GPU observability (DCGM, NVML, etc.) and integration into large‑scale telemetry systems
Deep knowledge of AI/ML infrastructure, high‑performance computing (HPC), networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools
Familiarity with modern observability stacks: metrics, logs, traces, OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch
Experience building and preferably deep understanding of secure, compliance‑focused telemetry pipelines (SOC2, FedRAMP)
Ability to articulate trade‑offs among latency, throughput, cost, and reliability to both engineering and executive audiences
Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models
Strong cross‑functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes

Preferred

Masters/Phd or Expertise in distributed systems, performance modeling, or fault‑tolerant computing
Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data‑center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification
Startup or 0 -> 1 experience building cloud‑native observability or resilience tools; proven success bringing open‑source observability products to market and shaping GTM strategy
Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud
Expertise with containerization technologies like Docker and Kubernetes, plus virtualization. Proficiency in network architecture and high‑performance interconnects (InfiniBand, Ethernet, RoCE)

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase