Apply on Employer Site

Sage Care · 21 hours ago

AI Diagnostics & Observability Engineer

Palo Alto, CA

Full-time

Onsite

Mid, Senior Level

Sage Care is focused on enhancing the reliability of their AI assistant through advanced diagnostics and observability. The AI Diagnostics & Observability Engineer will be responsible for building and maintaining infrastructure for real-time monitoring, root cause analysis, and automated triage systems to ensure the AI operates effectively across various integration points.

Artificial Intelligence (AI)Health CareHospitalMedical

H1B Sponsor Likely

Responsibilities

Build automated RCA pipelines to detect and classify failure modes:

Hallucinations

Misrouted intents

Leaked/invalid tool calls (Transfer, SayMessage, Hangup, NOOP)

Unrecoverable SOP loops

Broken state transitions

Telephony dropouts / DTMF issues

Implement event tracing infrastructure capturing every agentic decision across LLM, telephony, and SOP execution

Compare expected vs. actual SOP behavior using protocol-driven expectations or human-labeled ground truth

Automatically compute performance, safety, reliability, and coverage metrics

Build live and post-call dashboards that visualize:

Full call timeline

SOP/state machine traversal

Agent reasoning traces

Tool invocation history

Divergence from expected behavior

Design interactive visualizations: heatmaps, decision-path overlays, branching comparisons, and error hotspots

Build triage dashboards for engineering and operations teams to rapidly understand system health

Voice + Telephony Integration

Trace call-level events (dropouts, retries, audio playback issues)

Detect DTMF misfires and incorrect action routing

Analyze turn segmentation, word-error-rate drift, boosting performance, and latency

Visualize errors in context (audio, transcript, aligned timecodes)

Audit intent classification accuracy and subgraph routing

Trace reasoning sequences, missing tool calls, redundant tool calls, or invalid arguments

Validate tool call correctness (maps, SMS, search, internal SOP tools)

Architect a live SOP state-machine tracer with:

Real-time transcript overlays

Current state + next expected state

Deviation alerts

Build dashboards to monitor 10–15 concurrent calls, highlighting sessions with:

Loops

Latency spikes

Failed tool calls

Repeated incorrect decisions

Provide human specialists with escalation alerts and context

Build an intervention console for on-call specialists, enabling:

“Skip step”

“Say apology”

“Escalate to human”

“Send SMS”

“Repeat last message”

Override of SOP steps while maintaining auditability and continuity

Build clustering systems (via embeddings or metadata) to group systemic failure modes:

Intent misroutes under noisy audio

Repeated missing tool calls

Looped state machine traversal

Hallucinated follow-ups or invalid summaries

Generate recurring-failure reports to guide engineering improvements

Design and implement an automated triage and notification system that:

Detects failure category and severity in real time

Routes incidents to the correct module owners:

Telephony

Transcription

LLM orchestration

SOP/decision-tree team

Platform reliability

Sends structured payloads containing:

Trace graphs

Relevant logs

Transcript segments

SOP divergence snapshots

Suggested RCA labels

Extend pipelines to automatically generate human-readable failure summaries with:

Call-level trace graphs

Tool call sequences

Transcript context

Classified failure types

Suggested root causes

Store snapshots for operational handoff and debugging

Qualification

PythonEvent-driven tracingDiagnosticsObservabilityBackend data pipelinesFrontend dashboardsTelemetry frameworksClustering techniquesClinical operationsSIPWebRTCTwilioGrafanaELKOpenTelemetrySentry

Required

Strong backend engineer experienced with diagnostics, observability, and event-driven tracing

Expert in Python, logging systems, real-time pipelines, and distributed debugging

Deep familiarity with: LLM agents, LangGraph or state-machine frameworks, Tool-calling architectures, Telemetry or tracing frameworks

Comfortable designing both: Backend data pipelines, Frontend dashboards in React, D3, WebSockets, or equivalent

Preferred

Telephony/Voice: SIP, WebRTC, Twilio, audio streaming pipelines

Clinical operations, call-center workflows, or mission-critical HITL supervision systems

Observability stacks (Grafana, ELK, OpenTelemetry, Sentry)

Clustering/ML techniques for failure pattern discovery

Company

Sage Care

Sage Care uses AI to automate patient triage, optimize doctor-patient matching, and improve appointment efficiency.

Palo Alto, California, USA

2-10 employees

https://www.sage.care/

H1B Sponsorship

Sage Care has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

2024 (1)

Funding

Current Stage

Early Stage

Total Funding

$20M

Key Investors

Yosemite

2025-10-17Series Unknown· $20M

2024-04-08Seed

Recent News

Pulse 2.0

Sage Care: $20 Million Raised For AI-Based Care Navigation System

2025-10-27

thesaasnews.com

Sage Care Raises $20 Million in Funding

2025-10-20

Pulse 2.0

Sage Care: $20 Million Closed To Transform Healthcare Navigation Using AI

2025-10-20

Company data provided by crunchbase