This job has closed.

CVS Health · 3 months ago

Staff Site Reliability Engineer - Observability

AZ - Scottsdale

Full-time

Hybrid

Senior Level, Lead/Staff

$118K/yr - $261K/yr

7+ years exp

CVS Health is the nation’s leading health solutions company, dedicated to transforming health care. They are seeking a Staff Site Reliability Engineer focused on observability to lead the design and optimization of observability systems, ensuring reliability and performance across edge environments while collaborating with cross-functional teams.

Health CareMedicalPharmaceuticalRetailSales

H1B Sponsor Likely

Responsibilities

Design and implement comprehensive observability solutions tailored for edge computing environments, including monitoring, logging, tracing, and metrics collection, to provide deep visibility into system performance and health across distributed remote facilities

Define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and business KPIs to measure and enhance system reliability in edge and centralized infrastructure

Build and optimize dashboards, visualizations, and alerting systems to enable real-time insights and rapid incident response for edge nodes and remote facilities

Implement distributed tracing and log aggregation systems to troubleshoot complex issues in edge computing environments

Collaborate with engineering teams to ensure applications and infrastructure at edge locations are designed with observability in mind, incorporating best practices for instrumentation and monitoring in resource-constrained environments

Drive proactive identification of issues in edge facilities through advanced observability tools, reducing Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) across distributed systems

Lead incident postmortems, analyzing root causes specific to edge environments and implementing observability-driven improvements to prevent recurrence

Develop and maintain tools, scripts, and automation to enhance observability pipelines, optimizing for the unique challenges of edge computing, such as bandwidth limitations and intermittent connectivity

Evaluate and integrate industry-standard observability tools (e.g., Prometheus, Grafana, ELK Stack, OpenTelemetry) and recommend solutions tailored for edge computing use cases

Optimize observability data storage, retention, and querying to balance performance, cost, and scalability across a large number of remote facilities

Mentor and guide junior SREs and engineers on observability best practices for edge computing, fostering a culture of reliability and proactive monitoring

Partner with solution, engineering, and business teams to align observability efforts with business objectives, ensuring seamless operation of edge and centralized systems

Lead cross-functional initiatives to improve observability, reliability, and operational efficiency across distributed edge infrastructure

Stay current with emerging observability trends, tools, and methodologies, particularly those suited for edge computing and distributed systems, and advocate for their adoption

Contribute to the development of observability standards, runbooks, and documentation tailored for edge environments to ensure consistency and scalability

Drive cost optimization for observability infrastructure while maintaining high-quality monitoring and alerting capabilities across remote facilities

Qualification

Observability EngineeringDistributed SystemsEdge ComputingPrometheusGrafanaMicroservicesKubernetesDockerOpenTelemetryPythonJavaAIOpsChaos EngineeringCloud CertificationsIncident ManagementCommunication SkillsProblem-Solving Skills

Required

7+ years of experience in Site Reliability Engineering, Observability Engineering, or a related field

5+ years of experience with observability tools and platforms such as Prometheus, Grafana, Splunk, ELK, OpenTelemetry, or similar

3+ years of experience with microservices, containerized environments (e.g., Kubernetes, Docker), and distributed systems, particularly in edge deployments

Preferred

Experience with implementation of AIOps

Demonstrated ability to handle observability challenges in environments with intermittent connectivity, high latency, or geographically dispersed infrastructure

Strong proficiency in programming/scripting languages (e.g., Python, java) for automation and tooling in distributed environments

Expertise working in edge computing environments with a large number of remote facilities, managing observability for distributed, high-latency, or resource-constrained systems

Experience with OpenTelemetry or other open-source observability frameworks optimized for edge computing

Familiarity with chaos engineering principles to validate observability systems in edge environments

Certifications in cloud platforms (Google Cloud Professional certification) or Kubernetes

Strong problem-solving skills with a proactive, analytical mindset, particularly for addressing edge computing challenges

Excellent communication and collaboration skills to work effectively with cross-functional teams across centralized and remote locations

Ability to mentor and lead technical initiatives with a focus on observability and reliability in edge environments

Comfortable working in a fast-paced, dynamic environment with a focus on delivering customer value

Knowledge of incident management processes and tools (e.g., ServiceNow, xMatters, Opsgenie) tailored for distributed systems

Deep understanding of monitoring, logging, and tracing concepts, including metrics collection, log aggregation, and distributed tracing for edge and centralized systems

Familiarity with cloud infrastructure, CI/CD pipelines, and edge-specific deployment patterns

Benefits

Affordable medical plan options

401(k) plan (including matching company contributions)

Employee stock purchase plan

No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching

Benefit solutions that address the different needs and preferences of our colleagues including paid time off, flexible work schedules, family leave, dependent care resources, colleague assistance programs, tuition assistance, retiree medical access and many other benefits depending on eligibility

Company

CVS Health

Glassdoor3.1

CVS Health is a health solutions company that provides an integrated healthcare services to its members.

Founded in 1963

Woonsocket, Rhode Island, USA

10001+ employees

https://www.cvshealth.com/

H1B Sponsorship

CVS Health has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2022 (1)