Apply on Employer Site

Calfus Inc. · 2 weeks ago

L2 Production Support Engineer

San Francisco Bay Area, CA

Full-time

Onsite

Mid, Senior Level

4+ years exp

Calfus is known for delivering cutting-edge AI agents and products that transform businesses. The L2 Production Support Engineer is responsible for incident triage, runbook-based remediation, and ensuring smooth incident workflows for the on-call team.

AnalyticsBusiness DevelopmentInformation TechnologySoftware Engineering

Growth Opportunities

H1B Sponsor Likely

Responsibilities

Own initial triage for Sev-2/3/4 incidents and user-reported issues, including ticket classification and reproduction

Follow established runbooks to remediate common issues (service restarts, config toggles, data corrections, cache clears)

Monitor dashboards, alert streams, and on-call channels; acknowledge alerts and coordinate initial response

Participate in on-call rotation for non-Sev-1 issues and serve as secondary responder during major incidents

Provide clear, timely communication to users and stakeholders during incident resolution

Escalate complex or novel issues to L3 with excellent context: timeline, hypothesis, attempted steps, relevant logs, and metrics

Document escalations clearly for incident tracking and post-incident review

Ensure escalations include sufficient detail that L3 can pick up work without requiring clarification

Learn from L3's solutions and incorporate new findings into runbooks and knowledge base

Own and maintain the Operational Guide for the agentic on-call platform: standard procedures, troubleshooting flows, and decision trees

Create and update runbooks for recurring issues, preventive maintenance, and escalation patterns discovered through incidents

Regularly review and refine existing runbooks based on L2/L3 feedback and incident retrospectives

Test runbook accuracy quarterly and flag ambiguities or outdated instructions to the L2 team lead

Collaborate with L3 engineers to capture complex fixes as simplified runbooks for future L2 use

Maintain a knowledge base of common user issues and L2-resolvable solutions

Monitor key dashboards during shifts and validate alert accuracy (reduce false positives, tune thresholds)

Report missing or broken alerts to L3 for engineering fixes

Provide operational feedback on alerting gaps discovered during incidents

Assist in testing new alerts or monitoring rules before production deployment

Read and interpret logs, metrics, and dashboards to correlate incident signals and narrow root cause hypothesis

Execute safe runbook-based fixes: service restarts, configuration toggles, safe data queries, and cache clears

Apply L3-provided remediation steps for known failure patterns

Document troubleshooting steps taken to build context for escalations

Participate in incident post-mortems and RCA discussions, contribute observations from initial triage

Sharing learnings with L2 team through knowledge base updates and team sync meetings

Mentor and support newer L2 engineers through pairing and code review of runbook contributions

Provide constructive feedback on operational processes and suggest improvements

When a recurring issue is identified (by L2 or L3), collaborate to create a step-by-step runbook

Ensure runbooks are clear, actionable, and safe for L2 execution without requiring L3 escalation

Include decision trees: "if X, do Y; if Z, escalate to L3"

Test runbook accuracy by walking through it with a peer before publishing

Review runbooks quarterly for accuracy and relevance; update if processes or tool names have changed

Flag outdated runbooks during team syncs (e.g., "This runbook references an old dashboard URL")

Incorporate feedback from L3 when they fix complex issues: simplify complex fixes into runbook steps for future L2 use

Maintain a single, authoritative Operational Guide covering:

Platform architecture overview (high-level, non-code)

Alert guide: what each alert means, typical causes, and first-response steps

Runbook index: list of all runbooks with quick-reference links

Troubleshooting decision tree: common symptoms which runbook to follow

Escalation criteria and process

On-call procedures and communication protocols

Known issues and workarounds

Update the guide when new features deploy, alerts change, or new runbooks are created

Conduct semi-annual reviews of the guide to ensure accuracy and completeness

Maintain a searchable knowledge base (wiki, Notion, Confluence, or similar) with:

Common user issues and L2-resolvable solutions

Frequently asked questions with step-by-step answers

Post-incident summaries (non-sensitive) to share learnings

Troubleshooting checklists organized by symptom

Encourage L2 team members to contribute findings and suggest improvements

Archive or deprecate outdated entries quarterly

Incident response: Mean Time to Acknowledgment (MTTA) and Mean Time to Escalation (MTTE) for triage decisions

Runbook effectiveness: % of L2 team able to resolve tickets using runbooks without escalation; reduction in "unknown" escalations

Documentation quality: User and L3 feedback on runbook clarity and accuracy; reduced escalations due to missed troubleshooting steps

Operational guide updates: Guide reviewed and refreshed quarterly; 0 outdated procedures in active rotation

On-call reliability: response times, ticket accuracy, and team feedback on L2 availability and professionalism

Knowledge base engagement: number of contributions per quarter, search usage, and user satisfaction with knowledge base accuracy

Qualification

Incident TriageOperational DocumentationMonitoring ToolsTechnical TroubleshootingLinux Command LineSQL SkillsContainerizationProactive LearnerCommunication SkillsDetail-orientedCollaborative

Required

4–8+ years in application support, operational support, or platform operations roles

Strong dashboard reading and alert interpretation skills; ability to spot anomalies and correlate signals

Proficiency with on-call and ticketing tools: PagerDuty, Jira, ServiceNow, or similar

Familiarity with observability platforms: Prometheus, Grafana, Datadog, New Relic, or equivalent

Comfortable reading structured logs (JSON format) and using log aggregation platforms (ELK, Datadog, etc.)

Solid working knowledge of the agentic on-call platform architecture: core services, job scheduler, LLM orchestration, notification pipeline

Basic understanding of microservices: how they communicate, common failure modes, and escalation paths

Comfortable with Linux command line basics: SSH, file navigation, process inspection, basic grep/awk for log parsing

Familiarity with containerization and orchestration: Docker and Kubernetes at an operational level (restart pods, check logs, review resource usage)

Basic SQL read-only skills: able to run safe SELECT queries to validate data, check state, and gather troubleshooting context under runbook guidance

Understanding of CI/CD basics: awareness of deployment pipelines, rollback procedures, and config toggle mechanics

Exposure to LLM/agent usage patterns: understanding of tool-calling, context limits, rate limits, and vendor API quirks

Familiarity with common LLM failure modes: hallucinations, token exhaustion, timeouts, and vendor-specific rate-limiting

Ability to follow troubleshooting flows for agent-driven incidents (prompt tracing, tool execution validation, fallback behavior)

Understanding of incident classification (Sev-1/2/3/4) and appropriate escalation criteria

Knowledge of on-call best practices: communication protocols, incident documentation, and post-mortem participation

Comfortable with asynchronous and shift-based work; reliable responder with good alert acknowledgment habits

Customer-focused mindset: empathy for users and urgency in resolving their issues

Detail-oriented: accurate notetaking during incidents and meticulous runbook following

Proactive learner: ability to absorb new technologies, platforms, and troubleshooting patterns quickly

Collaborative: works well with L3 engineers, dev teams, and other operational teams

Shift-friendly: reliable availability during on-call rotations, including nights/weekends as scheduled

Humble & curious: asks clarifying questions, escalates appropriately, and doesn't hesitate to ask for help

Minimum 2–4 years in application/production support, technical support, operations, or platform engineering roles

Proven experience with incident triage, ticket management, and on-call workflows

Prior exposure to on-call systems or incident management platforms (PagerDuty, Squadcast or custom)

Experience with at least one agentic AI or LLM-integrated product (customer-facing or internal tools) is a plus

Comfortable working shift-based on-call rotation (evenings, nights, weekends, as scheduled)

Preferred

Prior experience in a Global Capability Center or consulting firm environment

Familiarity with incident severity frameworks and SLO/SLI concepts

Exposure to multiple monitoring and observability tools

Basic scripting (Python or bash) for custom diagnostics and automation

Experience writing operational procedures or internal documentation

Benefits

Medical

Group

Parental insurance

Gratuity

Provident fund options

Employee wellness

Birthday leave

Company

Calfus Inc.

Calfus is a modern software engineering and AI services company purpose-built for the enterprise.

Founded in 2021

Mountain View, California, USA

51-200 employees

https://www.calfus.com/

H1B Sponsorship

Calfus Inc. has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

2024 (5)

2023 (6)

Funding

Current Stage

Growth Stage

Leadership Team

Baljeet Chhazal

Chief Executive Officer

Rohit Agarwal

Co-Founder & Managing Director

Recent News

Newswire

Acumain and Calfus Announce Strategic Partnership

2023-12-21

Company data provided by crunchbase