Databricks · 2 months ago
Senior Incident Manager
Databricks is a data and AI company that empowers data teams to tackle challenging problems through their infrastructure platform. As a Senior Incident Manager, you will lead critical production incidents, ensuring effective communication and operational resilience while collaborating with engineering teams to improve reliability.
AnalyticsArtificial Intelligence (AI)Data StorageInformation TechnologyMachine Learning
Responsibilities
Lead critical incidents — coordinate multi-disciplinary response efforts across Databricks’ cloud-based services to rapidly mitigate impact and restore operations
Drive technical root cause analysis and Reliability improvements: collaborate with engineering teams to trace and document underlying causes across distributed systems, services, and data stores
Summarize key learnings, clearly communicate action items, and ensure that technical and procedural improvements are followed through
Own communications during incidents — deliver frequent, high-quality updates to internal stakeholders (executives, engineering leadership, support) and compose and publish customer-facing notifications that are accurate, timely, and empathetic
Mentor and train peers in both incident communication and technical response disciplines to raise the overall quality of Databricks’ incident response
Qualification
Required
5+ years of experience in incident management, site reliability engineering, or production operations supporting large-scale, cloud-native systems
Proven ability to lead and coordinate high-severity incidents, including identifying impact, isolating fault domains, and managing multi-team response efforts
Strong understanding of cloud infrastructure (AWS, Azure, or GCP) — including compute, networking, storage, and observability components
Deep expertise in log analysis and debugging
Familiarity with log aggregation and search tools (e.g., Datadog, Elasticsearch, Splunk, Cloud Logging, or OpenTelemetry)
Hands-on experience with observability systems — metrics, logging, and tracing frameworks (Prometheus, Grafana, OpenTelemetry, etc.)
Proficiency in at least one major programming or scripting language (Python, Go, or Bash) for automating diagnostics, data collection, or analysis
Experience developing and maintaining incident playbooks and communication templates to ensure consistent, timely updates
Excellent contextual interpretation and writing skills, as well as the ability to effectively summarize and communicate to both technical and business audiences, are required
BS, Master's or other advanced degree in Computer Science or Computer Engineering, or related Engineering field
Benefits
Annual performance bonus
Equity
Company
Databricks
Databricks is a data and AI platform that unifies data engineering, analytics, and machine learning on a lakehouse architecture.
H1B Sponsorship
Databricks has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (385)
2024 (319)
2023 (227)
2022 (222)
2021 (166)
2020 (64)
Funding
Current Stage
Late StageTotal Funding
$25.81BKey Investors
Counterpoint GlobalFranklin TempletonAndreessen Horowitz
2025-12-16Series Unknown· $4B
2025-09-08Series Unknown· $1B
2025-01-13Debt Financing· $5.25B
Recent News
Crunchbase News
2026-01-09
2026-01-09
Company data provided by crunchbase