SIGN IN
Staff Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

BlinkRx · 10 hours ago

Staff Site Reliability Engineer

Blink Health is the fastest growing healthcare technology company that builds products to make prescriptions accessible and affordable to everybody. The Staff Site Reliability Engineer will establish best practices, define observability strategies, and drive initiatives to enhance system reliability and performance within the organization.
AppsE-CommerceHealth CareOnline PortalsPharmaceutical

Responsibilities

Establish and evolve SRE best practices across the organization, including reliability principles, error budgets, incident response, postmortems, and operational readiness standards
Define and drive observability strategy for system health, performance, and reliability, including SLIs/SLOs, alerting quality, dashboards, and service health indicators
Design and implement software-driven solutions within the infrastructure domain, automating manual processes and eliminating operational complexity and toil
Act as a technical leader and force multiplier, helping set priorities and influencing decision-making across core cloud infrastructure, reliability tooling, and platform architecture
Take ownership of large, ambiguous initiatives, driving them from concept to delivery while aligning stakeholders across engineering, security, and product
Combine deep knowledge of software development, infrastructure, and security to improve platform resilience, scalability, performance, and compliance
Proactively identify systemic risks and reliability gaps, recommending and leading platform upgrades and architectural improvements before they become incidents
Partner with engineering teams to improve developer workflows, tooling, and operational maturity, increasing productivity while reducing cognitive load
Provide technical mentorship, architecture guidance, and high-quality design and code reviews for engineers across infrastructure and product teams
Lead by example in documentation and knowledge sharing, ensuring systems and processes are well-understood and not dependent on individual ownership
Participate in and help mature incident response, escalation practices, and post-incident learning across the organization

Qualification

Site Reliability EngineeringCloud Platforms (AWS)KubernetesPythonInfrastructure as CodeAgile EnvironmentTechnical MentorshipDocumentation

Required

Bachelor's or Master's degree in Computer Science or equivalent practical experience
7+ years of experience in site reliability engineering, infrastructure engineering, or platform engineering roles, with demonstrated impact at scale
Expert-level, methodical troubleshooting across the entire stack, from application to kernel to network
Strong command-line proficiency and deep expertise in Linux systems and operating system fundamentals
Advanced understanding of networking concepts including load balancing, proxies, DNS, TCP/IP, NAT, and service-to-service communication
Experience working across multiple languages (e.g., Python, Go, Bash, and familiarity troubleshooting application stacks such as React or similar)
Strong track record of automating repetitive and complex operational work to reduce toil and increase reliability
Ability to design and build internal tools (Python or Go) that standardize and scale engineering practices
Comfortable operating in an agile environment, with disciplined testing and quality practices
Deep experience with cloud platforms (AWS preferred, GCP/Azure acceptable), particularly managed services and production-grade architectures
Strong expertise in Kubernetes and container orchestration (EKS, Helm), including lifecycle management and operational best practices
Proven experience designing and implementing observability systems, including metrics, logging, tracing, dashboards, and alerting
Deep understanding of container technologies, security scanning, secrets management, dynamic configuration, and microservices architectures
Familiarity with service meshes and advanced traffic management concepts
Experience designing and maintaining company-wide IaC codebases using tools such as Terraform, Pulumi, CloudFormation, or Ansible
Ability to think holistically about infrastructure design, cost, reliability, security, and long-term maintainability

Company

BlinkRx is a prescription access platform that connects patients to branded medications, ensuring transparent pricing and home delivery.

Funding

Current Stage
Late Stage
Total Funding
$315M
Key Investors
1789 CapitalSuRo Capital8VC
2024-11-16Series D· $140M
2020-10-27Series Unknown· $10M
2017-04-12Series B· $90M

Leadership Team

leader-logo
Geoffrey Chaiken
Founder & CEO
linkedin
leader-logo
Matthew Chaiken
Co-Founder
linkedin
Company data provided by crunchbase