TensorWave · 1 day ago
Observability Engineer
TensorWave is on a mission to build seamless and resilient AI infrastructure at scale. They are seeking an Observability Engineer to own the observability stack and ensure systems are measurable, understandable, and debuggable, working closely with various teams to maintain high-quality observability practices.
AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureGenerative AIIaaS
Responsibilities
Own and evolve our observability and monitoring platform, with Grafana and Prometheus at its core
Design, build, and maintain high-quality metrics pipelines using Prometheus and related tooling
Create clear, actionable Grafana dashboards that tell a story — not just charts
Define and maintain alerts that are meaningful, actionable, and low-noise
Establish and enforce observability standards across services (metrics, logs, traces)
Partner with engineering teams to instrument applications correctly
Lead improvements to alerting strategies, SLOs, and SLIs
Support incident response by helping teams quickly understand what broke and why
Continuously evaluate and improve signal quality, cardinality, and cost
Identify observability gaps and eliminate blind spots before they become outages
Qualification
Required
Strong hands-on experience with Grafana and Prometheus
Deep understanding of metrics-based observability
Experience designing monitoring and alerting systems at scale
Strong knowledge of alerting best practices (burn rates, SLO-based alerts, noise reduction)
Experience working with distributed systems and cloud or Kubernetes environments
Ability to reason about system behavior using telemetry
Comfortable working across teams to improve instrumentation and visibility
Preferred
Experience with OpenTelemetry
Familiarity with logs and traces (Loki, Tempo, Jaeger, etc.)
Kubernetes observability experience
Experience operating observability systems in high-scale or production-critical environments
Infrastructure-as-Code experience (Terraform, Helm, etc.)
Benefits
Competitive Salary
Stock Options
100% paid Medical, Dental, and Vision insurance
Life and Voluntary Supplemental Insurance
Short Term Disability Insurance
Flexible Spending Account
401(k)
Flexible PTO
Paid Holidays
Parental Leave
Mental Health Benefits through Spring Health
Company
TensorWave
TensorWave is an AMD GPU exclusive Cloud that supports training and inference at scale
Funding
Current Stage
Growth StageTotal Funding
$146.71MKey Investors
Nexus Venture PartnersFundNV
2025-05-14Series A· $100M
2024-10-08Seed· $43M
2024-04-23Seed· $0.89M
Recent News
ReviewJournal
2025-12-19
2025-11-05
Company data provided by crunchbase