Observability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

TensorWave · 1 day ago

Observability Engineer

TensorWave is on a mission to build seamless and resilient AI infrastructure at scale. They are seeking an Observability Engineer to own the observability stack and ensure systems are measurable, understandable, and debuggable, working closely with various teams to maintain high-quality observability practices.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureGenerative AIIaaS

Responsibilities

Own and evolve our observability and monitoring platform, with Grafana and Prometheus at its core
Design, build, and maintain high-quality metrics pipelines using Prometheus and related tooling
Create clear, actionable Grafana dashboards that tell a story — not just charts
Define and maintain alerts that are meaningful, actionable, and low-noise
Establish and enforce observability standards across services (metrics, logs, traces)
Partner with engineering teams to instrument applications correctly
Lead improvements to alerting strategies, SLOs, and SLIs
Support incident response by helping teams quickly understand what broke and why
Continuously evaluate and improve signal quality, cardinality, and cost
Identify observability gaps and eliminate blind spots before they become outages

Qualification

GrafanaPrometheusMetrics-based observabilityMonitoring systems designAlerting best practicesDistributed systemsCloud environmentsOpenTelemetryKubernetes observabilityInfrastructure-as-Code

Required

Strong hands-on experience with Grafana and Prometheus
Deep understanding of metrics-based observability
Experience designing monitoring and alerting systems at scale
Strong knowledge of alerting best practices (burn rates, SLO-based alerts, noise reduction)
Experience working with distributed systems and cloud or Kubernetes environments
Ability to reason about system behavior using telemetry
Comfortable working across teams to improve instrumentation and visibility

Preferred

Experience with OpenTelemetry
Familiarity with logs and traces (Loki, Tempo, Jaeger, etc.)
Kubernetes observability experience
Experience operating observability systems in high-scale or production-critical environments
Infrastructure-as-Code experience (Terraform, Helm, etc.)

Benefits

Competitive Salary
Stock Options
100% paid Medical, Dental, and Vision insurance
Life and Voluntary Supplemental Insurance
Short Term Disability Insurance
Flexible Spending Account
401(k)
Flexible PTO
Paid Holidays
Parental Leave
Mental Health Benefits through Spring Health

Company

TensorWave

twittertwittertwitter
company-logo
TensorWave is an AMD GPU exclusive Cloud that supports training and inference at scale

Funding

Current Stage
Growth Stage
Total Funding
$146.71M
Key Investors
Nexus Venture PartnersFundNV
2025-05-14Series A· $100M
2024-10-08Seed· $43M
2024-04-23Seed· $0.89M

Leadership Team

leader-logo
Darrick Horton
Co-Founder / CEO
linkedin
leader-logo
Piotr Tomasik
Co-Founder, President & COO
linkedin
Company data provided by crunchbase