Staff Site Reliability Engineer - Observability jobs in United States
info-icon
This job has closed.
company-logo

CVS Health · 3 months ago

Staff Site Reliability Engineer - Observability

CVS Health is the nation’s leading health solutions company, dedicated to transforming health care. They are seeking a Staff Site Reliability Engineer focused on observability to lead the design and optimization of observability systems, ensuring reliability and performance across edge environments while collaborating with cross-functional teams.

Health CareMedicalPharmaceuticalRetailSales
check
H1B Sponsor Likelynote

Responsibilities

Design and implement comprehensive observability solutions tailored for edge computing environments, including monitoring, logging, tracing, and metrics collection, to provide deep visibility into system performance and health across distributed remote facilities
Define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and business KPIs to measure and enhance system reliability in edge and centralized infrastructure
Build and optimize dashboards, visualizations, and alerting systems to enable real-time insights and rapid incident response for edge nodes and remote facilities
Implement distributed tracing and log aggregation systems to troubleshoot complex issues in edge computing environments
Collaborate with engineering teams to ensure applications and infrastructure at edge locations are designed with observability in mind, incorporating best practices for instrumentation and monitoring in resource-constrained environments
Drive proactive identification of issues in edge facilities through advanced observability tools, reducing Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) across distributed systems
Lead incident postmortems, analyzing root causes specific to edge environments and implementing observability-driven improvements to prevent recurrence
Develop and maintain tools, scripts, and automation to enhance observability pipelines, optimizing for the unique challenges of edge computing, such as bandwidth limitations and intermittent connectivity
Evaluate and integrate industry-standard observability tools (e.g., Prometheus, Grafana, ELK Stack, OpenTelemetry) and recommend solutions tailored for edge computing use cases
Optimize observability data storage, retention, and querying to balance performance, cost, and scalability across a large number of remote facilities
Mentor and guide junior SREs and engineers on observability best practices for edge computing, fostering a culture of reliability and proactive monitoring
Partner with solution, engineering, and business teams to align observability efforts with business objectives, ensuring seamless operation of edge and centralized systems
Lead cross-functional initiatives to improve observability, reliability, and operational efficiency across distributed edge infrastructure
Stay current with emerging observability trends, tools, and methodologies, particularly those suited for edge computing and distributed systems, and advocate for their adoption
Contribute to the development of observability standards, runbooks, and documentation tailored for edge environments to ensure consistency and scalability
Drive cost optimization for observability infrastructure while maintaining high-quality monitoring and alerting capabilities across remote facilities

Qualification

Observability EngineeringDistributed SystemsEdge ComputingPrometheusGrafanaMicroservicesKubernetesDockerOpenTelemetryPythonJavaAIOpsChaos EngineeringCloud CertificationsIncident ManagementCommunication SkillsProblem-Solving Skills

Required

7+ years of experience in Site Reliability Engineering, Observability Engineering, or a related field
5+ years of experience with observability tools and platforms such as Prometheus, Grafana, Splunk, ELK, OpenTelemetry, or similar
3+ years of experience with microservices, containerized environments (e.g., Kubernetes, Docker), and distributed systems, particularly in edge deployments

Preferred

Experience with implementation of AIOps
Demonstrated ability to handle observability challenges in environments with intermittent connectivity, high latency, or geographically dispersed infrastructure
Strong proficiency in programming/scripting languages (e.g., Python, java) for automation and tooling in distributed environments
Expertise working in edge computing environments with a large number of remote facilities, managing observability for distributed, high-latency, or resource-constrained systems
Experience with OpenTelemetry or other open-source observability frameworks optimized for edge computing
Familiarity with chaos engineering principles to validate observability systems in edge environments
Certifications in cloud platforms (Google Cloud Professional certification) or Kubernetes
Strong problem-solving skills with a proactive, analytical mindset, particularly for addressing edge computing challenges
Excellent communication and collaboration skills to work effectively with cross-functional teams across centralized and remote locations
Ability to mentor and lead technical initiatives with a focus on observability and reliability in edge environments
Comfortable working in a fast-paced, dynamic environment with a focus on delivering customer value
Knowledge of incident management processes and tools (e.g., ServiceNow, xMatters, Opsgenie) tailored for distributed systems
Deep understanding of monitoring, logging, and tracing concepts, including metrics collection, log aggregation, and distributed tracing for edge and centralized systems
Familiarity with cloud infrastructure, CI/CD pipelines, and edge-specific deployment patterns

Benefits

Affordable medical plan options
401(k) plan (including matching company contributions)
Employee stock purchase plan
No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
Benefit solutions that address the different needs and preferences of our colleagues including paid time off, flexible work schedules, family leave, dependent care resources, colleague assistance programs, tuition assistance, retiree medical access and many other benefits depending on eligibility

Company

CVS Health

company-logo
CVS Health is a health solutions company that provides an integrated healthcare services to its members.

H1B Sponsorship

CVS Health has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2022 (1)

Funding

Current Stage
Public Company
Total Funding
$4B
Key Investors
Michigan Economic Development CorporationStarboard Value
2025-08-15Post Ipo Debt· $4B
2025-07-17Grant· $1.5M
2019-11-25Post Ipo Equity

Leadership Team

leader-logo
David Joyner
President and Chief Executive Officer, CVS Health
linkedin
leader-logo
Chandra McMahon
SVP & CISO
linkedin
Company data provided by crunchbase