Sr Staff Engineer- Availability and Incident Managment jobs in United States
cer-icon
Apply on Employer Site
company-logo

GEICO · 4 hours ago

Sr Staff Engineer- Availability and Incident Managment

GEICO is seeking an experienced Engineer with a passion for building high-performance, low maintenance, zero-downtime platforms and applications. The Senior Staff Engineer in Availability and Incident Management will engineer solutions and empower the engineering community with automated processes, data-driven insights, and technical tools that reduce incident recurrence and improve system reliability.

Auto InsuranceFinancial ServicesGovernmentInsuranceInternetMobile
check
H1B Sponsorednote

Responsibilities

Lead the strategy and execution for incident retrospective and correction of error (COE) processes across the engineering organization
Help conduct deep technical root cause analysis and incident forensics across distributed systems using observability data, logs, metrics, and traces
Establish continuous improvement loops through automated trend analysis, pattern recognition algorithms, and predictive analytics
Design, code, and deploy automation platforms and self-service tools using Python, Go, Java, or C# that scale incident retrospective workflows and eliminate manual tracking
Build production-grade data pipelines, analytics systems, and real-time dashboards to measure incident trends, COE effectiveness, and action item completion rates
Write code for workflow automation, integrations with observability platforms, and APIs that connect incident management tools across the engineering ecosystem
Leverage SQL and NoSQL databases to store, query, and analyze incident data at scale using Azure tools and cloud-native services
Develop and maintain systems that ensure rigorous follow-through on action items, remediation plans, and preventive measures with automated tracking
Partner with service engineering teams to implement preventive measures and architectural improvements based on incident patterns
Present data-driven insights and incident trend analysis to leadership and engineering teams to drive preventive action
Influence and educate leadership on incident patterns, prevention strategies, and reliability best practices
Mentor engineers on coding best practices, automation techniques, and strengthen technical expertise across the engineering community
Stay current with industry advances in SRE, observability, incident management, and automation; educate teams on emerging practices

Qualification

Automation platformsIncident forensicsDistributed systemsData analyticsPythonGoJavaSQL databasesNoSQL databasesCloud providersCI/CD pipelinesContinuous improvementCommunicationMentoringProblem-solving

Required

Experience building automation platforms and self-service tools for workflow management, analytics, or engineering productivity
Fluency in at least two modern languages such as Python, Go, Java, C++, or C# including object-oriented design
Experience building microservices architectures, REST APIs, and distributed systems
Experience with data pipelines, analytics platforms, and visualization tools for operational metrics and KPIs
Experience with SQL and NoSQL databases (e.g., PostgreSQL, MongoDB, Cassandra, CosmosDB) for data storage and analytics
Experience with observability platforms (Prometheus, Grafana, Datadog, Splunk, ELK) and distributed systems monitoring, logging, and tracing
Experience with cloud providers (Azure, AWS, or GCP) and cloud-native architectures
Experience with CI/CD pipelines, infrastructure as code, and container orchestration (Kubernetes, Docker)
Experience writing workflow automation code (YAML pipelines, GitHub Actions, Azure DevOps pipelines)
Strong understanding of distributed systems architecture, design patterns, reliability, and scaling
Knowledge of retrospective facilitation, continuous improvement processes, and blameless culture principles
Strong architecture and design skills with ability to influence engineering direction and technical roadmap
Experience solving complex analytical problems with data-driven approaches
Proven ability to partner with cross-functional engineering teams and drive systemic improvements
Excellent communication skills with ability to present technical insights to leadership and influence decision-making

Preferred

Experience leveraging GenAI or LLMs is a plus

Benefits

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being.
Financial benefits including market-competitive compensation; a 401K savings plan vested from day one that offers a 6% match; performance and recognition-based incentives; and tuition assistance.
Access to additional benefits like mental healthcare as well as fertility and adoption assistance.
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year.

Company

GEICO, Government Employees Insurance Company, has been providing affordable auto insurance since 1936. It is a sub-organization of Berkshire Hathaway.

H1B Sponsorship

GEICO has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (128)
2024 (277)
2023 (338)
2022 (212)
2021 (148)
2020 (205)

Funding

Current Stage
Late Stage
Total Funding
unknown
1996-01-01Acquired

Leadership Team

leader-logo
Todd Combs
Chairman, President, and Chief Executive Officer
leader-logo
Clayton Johnson
Sr. Director of Product Management
linkedin
Company data provided by crunchbase