Apply on Employer Site

Lirio · 5 hours ago

Senior System Reliability Engineer

United States

Full-time

Remote

Senior Level

$130K/yr - $150K/yr

5+ years exp

Lirio is a technology/software company that provides expertise in behavioral science, data science, and machine learning. The Senior System Reliability Engineer will be responsible for the reliability, scalability, and performance of cloud-native applications and infrastructure, leading automation, monitoring, and incident response processes while mentoring other engineers.

Artificial Intelligence (AI)Health CareInformation TechnologyMachine Learning

Comp. & Benefits

No H1B

Responsibilities

Architect, implement, and maintain automated solutions for deployment, monitoring, alerting and incident response using Lirio’s technology stack (AWS, Azure, Kubernetes, Kafka, Java, TypeScript, Groovy, Databases/SQL)

Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation)

Build and optimize CI/CD pipelines for seamless, reliable delivery

Define, implement, and continuously refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical services

Identify and reduce operational toil through automation, platform improvements, and architectural changes

Performance analysis and optimization of Lirio systems and services

Ensure high availability and scalability of services through proactive engineering, load testing, and capacity planning across multi-tenant and client-specific environments

Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness

Collaborate with software engineers to embed reliability, security, and operational best practices into development workflows

Partner with software engineering teams during design and architecture discussions to identify reliability risks early

Monitor system health using modern observability tools (e.g., Prometheus, Grafana, Datadog)

Participate in a defined on-call rotation supporting production systems, with clear escalation paths and expectations

Contribute to and maintain incident severity definitions, response procedures, and no-blame postmortem practices

Lead incident response, root cause analysis, and postmortems for production issues

Triage and resolve issues, ensuring minimal downtime and rapid recovery

Support client onboarding and production rollouts by ensuring reliability, observability, and operational readiness standards are met

Mentor and coach engineers on reliability engineering principles, operational ownership, and incident response best practices

Design processes to share operational knowledge and avoid single points of failure

Advise colleagues on architecture and reliability strategies

Help establish shared operational ownership across teams to reduce single points of failure and knowledge silos

Stay current with industry trends in reliability engineering, cloud operations, and automation

Bring innovation to operational practices and system design, evaluating and introducing new tools and technologies as appropriate for Lirio

Evaluate new tooling with an emphasis on operational simplicity, security, and long-term maintainability

Define and document operational processes, incident response playbooks, and reliability standards

Contribute to operational planning, incident reviews, and reliability documentation

Qualification

Linux systemsCloud platformsContainersOrchestrationCI/CD toolsDistributed systems debuggingScripting languagesObservability toolsAgile methodologiesVersion controlDatabases/SQLLoad testing

Required

5-7 years related experience

Bachelor's Degree in related field

Linux systems and networking fundamentals (DNS, TCP/IP, TLS)

Distributed systems debugging and failure analysis

Load, stress, and fault-injection testing

CI/CD tools and processes

Version control (e.g., Git)

Cloud platforms (e.g., AWS, Azure)

Containers and orchestration (Kubernetes)

Kafka (messaging/streaming)

Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python)

Agile methodologies (e.g., Scrum, XP, SAFe)

Databases/SQL

Observability/monitoring tools (DataDog)

Benefits

Medical (HSA available)

Dental

Vision

Short-term & long-term disability (company-paid)

Life & AD&D (company-paid)

401K with company match

10 paid holidays, quarterly company closure dates, + holiday week company closure

Flexible time off policy

Work from home

6 weeks paid parental leave

Company

Lirio

Lirio is a behavior change AI platform that delivers behavioral engagement solutions for organizations.

Founded in 2016

Knoxville, Tennessee, USA

51-200 employees

https://lirio.com

Funding

Current Stage

Growth Stage

Total Funding

$65M

Key Investors

WR Hambrecht

2022-07-25Debt Financing· $3M

2022-05-12Series Unknown· $14.39M

2021-11-09Debt Financing· $1.36M

Leadership Team

Mike West

Founder, CEO & Chairman

Wade Chandler

Chief Architect

Recent News

Pulse 2.0

Lirio: Interview With Chief Behavioral Officer Amy Bucher, PhD About The Personalized Engagement Platform

2026-01-11

MedCity News

LillyDirect Taps Walmart for First Retail Pick-Up Option for Zepbound

2025-11-02

AstuteAnalytica India Pvt. Ltd.

Chronic Disease Management Market to Reach US$ 17.1 Billion by 2033 | Astute Analytica

2025-10-27

Company data provided by crunchbase