NVIDIA · 22 hours ago
Principal Software Engineer, AIOps and Observability
NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. They are seeking a highly skilled Principal Software Engineer to design and develop AIOps & Observability platforms used by internal teams to monitor and optimize products and services. The role involves leading the technical vision, collaborating with teams, and mentoring engineers in observability and machine learning.
AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
Responsibilities
Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations
Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices
Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations
Establish and implement observability standards, guidelines, and processes across NVIDIA. Research, evaluate, and adopt new observability technologies and frameworks that can enhance user experience
Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness
Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events. Handle large volumes of data and ensure data quality, security, and compliance
Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads
Find opportunities to automate remediation of commonly occurring issues to operate systems reliably and efficiently
Qualification
Required
Bachelor's degree in computer science and engineering, or related field, or equivalent experience
15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment
Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Vector, Loki, Grafana, Alert Manager, Clickhouse, OpenTelemetry, etc
Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Datadog, etc
Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc
Proficient in one or more programming languages, such as Go, Python, Java, C#, etc
Passionate about observability and delivering high-quality internal platforms
Experience with developing Observability solutions to monitor On-prem and Public Cloud environments
Experience with running large Observability platforms on BareMetal Infrastructure
Establish scalable data pipelines and instrumentation for collecting, aggregating, and visualizing telemetry and operational metrics
Preferred
Deep understanding of implementing Observability solutions to large scale on-prem Infrastructure and Networking
Hands-on experience with managing large scale Observability Platforms with LLMs & ML Models and building custom services to ingest billions of metrics and logs from wide range of assets
Developed unified cloud observability platform to monitor Network, Compute, Power, Storage, Operating Systems, Security, Applications, SaaS Platforms
Demonstrated experience and expertise in using machine learning and Generative AI to develop solutions such as predictive monitoring, incident diagnosis, summarization and correlation
Demonstrate proficiency in AI/ML systems, generative AI, or agentic AI frameworks
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
Business Insider
2026-01-09
Business Insider
2026-01-09
Company data provided by crunchbase