Senior Site Reliability Engineer - Observability jobs in United States
cer-icon
Apply on Employer Site
company-logo

Moderna · 3 months ago

Senior Site Reliability Engineer - Observability

Moderna is a pioneering company revolutionizing medicine through mRNA technology. They are seeking a Senior Site Reliability Engineer – Observability to lead the development of a modern observability platform, focusing on building and operating scalable solutions that enhance system reliability and performance.

BiotechnologyGeneticsHealth CareMedicalPharmaceuticalTherapeutics
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Manage and advance Moderna’s enterprise observability platform with a focus on open-source and SaaS observability technologies (Grafana, Prometheus, Loki, Tempo, Jaeger, OpenTelemetry, Dynatrace, Splunk, etc.)
Lead governance, agent fleet management, and FinOps optimization to ensure the platform is scalable, cost-effective, and compliant with enterprise requirements
Balance hands-on engineering work (building, configuring, and operating the platform) with strategic ownership (roadmap influence, governance, cost optimization)
Collaborate with vendors and open-source communities to influence feature roadmaps and maximize platform value
Design and build highly scalable, resilient, and cost-optimized observability architectures to support application, database, host, and container monitoring
Implement telemetry pipelines for metrics, traces, and logs using Grafana, Prometheus exporters (e.g., Node, Blackbox), Kubernetes instrumentation, distributed tracing, or similar technologies
Establish and evolve best practices for monitoring, alerting, SLOs/SLIs, and incident detection across hybrid environments (cloud-native and on-prem)
Partner with application and infrastructure teams to enable self-service observability capabilities, accelerating troubleshooting and reliability improvements
Build and maintain enterprise-scale log management capabilities within the observability platform
Evolve log management to serve as a scalable, cost-effective alternative to traditional log aggregation solutions
Partner with security and infrastructure teams to ensure logging meets performance, compliance, and retention requirements
Integrate observability solutions with incident management platforms such as PagerDuty to streamline escalation, response, and workflow automation
Oversee and optimize on-call processes, ensuring alerts are actionable, routed effectively, and resolved quickly
Provide real-time telemetry during incidents and support root cause analysis (RCA) backed by observability data
Develop automation using Python, Terraform, Ansible, and CI/CD pipelines to streamline observability workflows
Implement self-healing mechanisms and automated remediation for recurring reliability issues
Ensure integrations with enterprise platforms, including PagerDuty, ServiceNow, and Jira, to enhance incident, change, and problem management
Deliver dashboards and reporting that give both engineers and leadership actionable visibility into system health, reliability, and costs
Track and report key metrics such as MTTA, MTTR, error, and cost per workload
Create documentation, runbooks, and training to support adoption and consistency across engineering teams
Participate in post-incident reviews, applying lessons learned to refine monitoring strategies and prevent recurrence
Promote a culture of continuous learning, improvement, and observability adoption across the enterprise

Qualification

Observability platformsGrafanaPrometheusPythonTerraformAnsibleIncident managementCloud environmentsLog managementContinuous improvementSelf-starterProblem-solving

Required

7+ years of experience in site reliability engineering, observability, or platform engineering
Extensive expertise in managing and administering SaaS (Dynatrace, Splunk, or similar) or open-source observability platforms, including governance, agent fleet management, and cost optimization
Proven experience designing and building scalable, resilient, and cost-effective observability platforms using Prometheus, Grafana, Node/Blackbox Exporters, Kubernetes, or similar
Strong knowledge of observability practices (metrics, logs, traces, SLO/SLI design) across complex, large-scale enterprise environments
Hands-on experience with incident management platforms such as PagerDuty and ITSM integrations (ServiceNow, Jira)
Proficiency in automation and infrastructure-as-code (Python, Terraform, Ansible, Bash)
Experience monitoring and troubleshooting hybrid and cloud-native environments (AWS, Azure, or GCP)
Strong problem-solving skills and the ability to operate in a high-paced, global environment
Demonstrated ability to take initiative, work independently, and drive outcomes in complex enterprise environments

Preferred

Experience working in biotech, pharmaceutical, healthcare, or other regulated environments (e.g., GxP, HIPAA)
Experience with enterprise-scale log management (e.g., Loki, Elastic, Splunk) and retention/cost optimization
Familiarity with ITSM processes and integrations with observability solutions
Relevant certifications in AWS, Azure, Dynatrace, Splunk or related observability technologies
A proactive, innovative mindset with a passion for open-source adoption, continuous improvement, and automation

Benefits

Quality healthcare and insurance benefits
Lifestyle Spending Accounts to create your own pathway to well-being
Free premium access to fitness, nutrition, and mindfulness classes
Family planning and adoption benefits
Generous paid time off, including vacation, bank holidays, volunteer days, sabbatical, global recharge days, and a discretionary year-end shutdown
Savings and investments
Location-specific perks and extras!

Company

Moderna Therapeutics is a biotechnology company that specializes in vaccines and drug development.

H1B Sponsorship

Moderna has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (42)
2024 (49)
2023 (55)
2022 (23)
2021 (58)
2020 (19)

Funding

Current Stage
Public Company
Total Funding
$4.56B
Key Investors
Coalition for Epidemic Preparedness InnovationsAres ManagementU.S. Department of Health & Human Services
2025-12-18Grant· $54.3M
2025-11-20Post Ipo Debt· $600M
2024-07-02Grant· $176M

Leadership Team

leader-logo
Stephane Bancel
Founding CEO
linkedin
leader-logo
Kenneth Chien
Co-Founder
linkedin
Company data provided by crunchbase