Apply on Employer Site

Calfus Inc. · 3 weeks ago

L3 Production Support Engineer

San Francisco Bay Area, CA

Full-time

Onsite

Senior Level

6+ years exp

Calfus is known for delivering cutting-edge AI agents and products that transform businesses. The L3 Production Support Engineer is responsible for managing complex production incidents and implementing improvements for the agentic on-call management platform, ensuring reliability at scale.

AnalyticsBusiness DevelopmentInformation TechnologySoftware Engineering

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Own Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolution

Lead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UI

Establish and refine incident response playbooks, runbooks, and escalation procedures

Participate in on-call rotation as primary/secondary responder with accountability for critical systems

Perform deep production troubleshooting: log analysis, distributed tracing, metric correlation, and profiling under pressure

Diagnose and fix complex issues across microservices: scheduling engine, LLM orchestration, notification pipeline, and integrations

Optimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraints

Architect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scale

Work with Kubernetes, container orchestration, and deployment pipelines; manage rollbacks and feature toggles during incidents

Resolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)

Debug and ship targeted React fixes when UI is the fastest path to incident resolution

Drive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handling

Collaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changes

Design and tune monitoring, alerting, and SLO/SLI frameworks for the platform

Maintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emerge

Mentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practices

Drive blameless post-mortems and systemic risk reduction across the platform

Qualification

Python/FastAPIPostgreSQLKubernetesReactRedisCI/CD PipelinesAsync APIsIncident ManagementObservabilityShell ScriptingTechnical DepthMentorship MindsetDocumentation DisciplineCross-Functional Collaboration

Required

5–8+ years in backend engineering with strong hands-on experience in Python/FastAPI or equivalent

Deep knowledge of async APIs, background jobs, message queues (Celery, RabbitMQ, or similar), and distributed scheduling

Production-grade database skills: PostgreSQL query optimization, locking, migrations, and performance tuning

Redis expertise: caching patterns, rate limiting, streams, and pub/sub for real-time systems

Strong observability and on-call mindset: designing alerts, understanding SLOs/SLIs, error budgets, and Sev definitions

Proficiency with Kubernetes, Docker, container orchestration, and CI/CD pipelines (Jenkins, Bitbucket, GitHub Actions)

Understanding of cloud infrastructure (Azure preferred) and networking fundamentals

Solid grasp of LLM orchestration concepts: prompt engineering, tool-calling, context windows, rate limits, and vendor-specific behavior

Experience with LLM failure modes: hallucinations, token limits, timeout patterns, and cost/latency tradeoffs

Knowledge of agent frameworks (LangGraph, similar) and how they compose across microservices

Ability to debug LLM-driven flows: tracing prompts, understanding retry/backoff behavior, and validating tool outputs

2–3+ years hands-on with React and TypeScript in production environments

Competency reading and modifying existing React code: components, hooks, routing, state management (Redux/Context)

Browser debugging skills: DevTools, React DevTools, network throttling, and performance profiling

Ability to implement targeted UI fixes: form validation, error handling, API error display, and minor UX hardening

Familiarity with frontend build pipelines: Webpack/Vite, environment configs, feature flags, and deployment strategies

Expert-level log parsing and correlation across services using structured logging (JSON, correlation IDs)

Proficiency with observability platforms (Prometheus, Grafana, Datadog, New Relic, or similar)

Ability to construct and execute production queries under incident time pressure

Strong shell scripting (bash/Python) for diagnostics, automation, and custom monitoring

Incident command maturity: composure under pressure, clear communication, and decisive decision-making during critical outages

Technical depth with breadth: deep backend knowledge + sufficient full-stack awareness to own end-to-end incidents

Mentorship mindset: capable of raising L2 engineers through code review, pairing, and RCA participation

Documentation discipline: ability to capture runbooks, architecture decisions, and lessons learned clearly

Cross-functional collaboration: working effectively with dev, SRE, platform, and business teams during incidents

Minimum 6–10 years in backend/platform/SRE roles with at least 3+ years in production support, incident response, or on-call engineering

Proven track record leading Sev-1/Sev-2 incidents in distributed, multi-service systems

Experience with at least one agentic AI or LLM-integrated product (customer-facing or internal tools)

Comfortable with continuous on-call rotation and on-demand availability for critical incidents

Preferred

Experience with on-call/incident management platforms (PagerDuty, Squadcast, Opsgenie, or custom solutions)

Familiarity with RBAC, SSO, and authentication/authorization patterns

Knowledge of RAG (Retrieval Augmented Generation) systems

Company

Calfus Inc.

Calfus is a modern software engineering and AI services company purpose-built for the enterprise.

Founded in 2021

Mountain View, California, USA

51-200 employees

https://www.calfus.com/

H1B Sponsorship

Calfus Inc. has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

2024 (5)

2023 (6)

Funding

Current Stage

Growth Stage

Leadership Team

Baljeet Chhazal

Chief Executive Officer

Rohit Agarwal

Co-Founder & Managing Director

Recent News

Newswire

Acumain and Calfus Announce Strategic Partnership

2023-12-21

Company data provided by crunchbase