Principal Software Engineer, AIOps and Observability jobs in United States
cer-icon
Apply on Employer Site
company-logo

NVIDIA · 22 hours ago

Principal Software Engineer, AIOps and Observability

NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. They are seeking a highly skilled Principal Software Engineer to design and develop AIOps & Observability platforms used by internal teams to monitor and optimize products and services. The role involves leading the technical vision, collaborating with teams, and mentoring engineers in observability and machine learning.

AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations
Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices
Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations
Establish and implement observability standards, guidelines, and processes across NVIDIA. Research, evaluate, and adopt new observability technologies and frameworks that can enhance user experience
Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness
Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events. Handle large volumes of data and ensure data quality, security, and compliance
Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads
Find opportunities to automate remediation of commonly occurring issues to operate systems reliably and efficiently

Qualification

Observability toolsAIOps toolsKubernetesProgramming languagesCloud-native environmentData pipelinesMachine learningMentoringCollaborationProblem-solving

Required

Bachelor's degree in computer science and engineering, or related field, or equivalent experience
15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment
Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Vector, Loki, Grafana, Alert Manager, Clickhouse, OpenTelemetry, etc
Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Datadog, etc
Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc
Proficient in one or more programming languages, such as Go, Python, Java, C#, etc
Passionate about observability and delivering high-quality internal platforms
Experience with developing Observability solutions to monitor On-prem and Public Cloud environments
Experience with running large Observability platforms on BareMetal Infrastructure
Establish scalable data pipelines and instrumentation for collecting, aggregating, and visualizing telemetry and operational metrics

Preferred

Deep understanding of implementing Observability solutions to large scale on-prem Infrastructure and Networking
Hands-on experience with managing large scale Observability Platforms with LLMs & ML Models and building custom services to ingest billions of metrics and logs from wide range of assets
Developed unified cloud observability platform to monitor Network, Compute, Power, Storage, Operating Systems, Security, Applications, SaaS Platforms
Demonstrated experience and expertise in using machine learning and Generative AI to develop solutions such as predictive monitoring, incident diagnosis, summarization and correlation
Demonstrate proficiency in AI/ML systems, generative AI, or agentic AI frameworks

Benefits

Equity
Benefits

Company

NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.

H1B Sponsorship

NVIDIA has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)

Funding

Current Stage
Public Company
Total Funding
$4.09B
Key Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity

Leadership Team

leader-logo
Jensen Huang
Founder and CEO
linkedin
leader-logo
Michael Kagan
Chief Technology Officer
linkedin
Company data provided by crunchbase