AI Diagnostics & Observability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Sage Care · 21 hours ago

AI Diagnostics & Observability Engineer

Sage Care is focused on enhancing the reliability of their AI assistant through advanced diagnostics and observability. The AI Diagnostics & Observability Engineer will be responsible for building and maintaining infrastructure for real-time monitoring, root cause analysis, and automated triage systems to ensure the AI operates effectively across various integration points.

Artificial Intelligence (AI)Health CareHospitalMedical
check
H1B Sponsor Likelynote

Responsibilities

Build automated RCA pipelines to detect and classify failure modes:
Hallucinations
Misrouted intents
Leaked/invalid tool calls (Transfer, SayMessage, Hangup, NOOP)
Unrecoverable SOP loops
Broken state transitions
Telephony dropouts / DTMF issues
Implement event tracing infrastructure capturing every agentic decision across LLM, telephony, and SOP execution
Compare expected vs. actual SOP behavior using protocol-driven expectations or human-labeled ground truth
Automatically compute performance, safety, reliability, and coverage metrics
Build live and post-call dashboards that visualize:
Full call timeline
SOP/state machine traversal
Agent reasoning traces
Tool invocation history
Divergence from expected behavior
Design interactive visualizations: heatmaps, decision-path overlays, branching comparisons, and error hotspots
Build triage dashboards for engineering and operations teams to rapidly understand system health
Voice + Telephony Integration
Trace call-level events (dropouts, retries, audio playback issues)
Detect DTMF misfires and incorrect action routing
Analyze turn segmentation, word-error-rate drift, boosting performance, and latency
Visualize errors in context (audio, transcript, aligned timecodes)
Audit intent classification accuracy and subgraph routing
Trace reasoning sequences, missing tool calls, redundant tool calls, or invalid arguments
Validate tool call correctness (maps, SMS, search, internal SOP tools)
Architect a live SOP state-machine tracer with:
Real-time transcript overlays
Current state + next expected state
Deviation alerts
Build dashboards to monitor 10–15 concurrent calls, highlighting sessions with:
Loops
Latency spikes
Failed tool calls
Repeated incorrect decisions
Provide human specialists with escalation alerts and context
Build an intervention console for on-call specialists, enabling:
“Skip step”
“Say apology”
“Escalate to human”
“Send SMS”
“Repeat last message”
Override of SOP steps while maintaining auditability and continuity
Build clustering systems (via embeddings or metadata) to group systemic failure modes:
Intent misroutes under noisy audio
Repeated missing tool calls
Looped state machine traversal
Hallucinated follow-ups or invalid summaries
Generate recurring-failure reports to guide engineering improvements
Design and implement an automated triage and notification system that:
Detects failure category and severity in real time
Routes incidents to the correct module owners:
Telephony
Transcription
LLM orchestration
SOP/decision-tree team
Platform reliability
Sends structured payloads containing:
Trace graphs
Relevant logs
Transcript segments
SOP divergence snapshots
Suggested RCA labels
Extend pipelines to automatically generate human-readable failure summaries with:
Call-level trace graphs
Tool call sequences
Transcript context
Classified failure types
Suggested root causes
Store snapshots for operational handoff and debugging

Qualification

PythonEvent-driven tracingDiagnosticsObservabilityBackend data pipelinesFrontend dashboardsTelemetry frameworksClustering techniquesClinical operationsSIPWebRTCTwilioGrafanaELKOpenTelemetrySentry

Required

Strong backend engineer experienced with diagnostics, observability, and event-driven tracing
Expert in Python, logging systems, real-time pipelines, and distributed debugging
Deep familiarity with: LLM agents, LangGraph or state-machine frameworks, Tool-calling architectures, Telemetry or tracing frameworks
Comfortable designing both: Backend data pipelines, Frontend dashboards in React, D3, WebSockets, or equivalent

Preferred

Telephony/Voice: SIP, WebRTC, Twilio, audio streaming pipelines
Clinical operations, call-center workflows, or mission-critical HITL supervision systems
Observability stacks (Grafana, ELK, OpenTelemetry, Sentry)
Clustering/ML techniques for failure pattern discovery

Company

Sage Care

twittertwitter
company-logo
Sage Care uses AI to automate patient triage, optimize doctor-patient matching, and improve appointment efficiency.

H1B Sponsorship

Sage Care has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
2024 (1)

Funding

Current Stage
Early Stage
Total Funding
$20M
Key Investors
Yosemite
2025-10-17Series Unknown· $20M
2024-04-08Seed
Company data provided by crunchbase