Lirio · 5 hours ago
Senior System Reliability Engineer
Lirio is a technology/software company that provides expertise in behavioral science, data science, and machine learning. The Senior System Reliability Engineer will be responsible for the reliability, scalability, and performance of cloud-native applications and infrastructure, leading automation, monitoring, and incident response processes while mentoring other engineers.
Artificial Intelligence (AI)Health CareInformation TechnologyMachine Learning
Responsibilities
Architect, implement, and maintain automated solutions for deployment, monitoring, alerting and incident response using Lirio’s technology stack (AWS, Azure, Kubernetes, Kafka, Java, TypeScript, Groovy, Databases/SQL)
Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation)
Build and optimize CI/CD pipelines for seamless, reliable delivery
Define, implement, and continuously refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical services
Identify and reduce operational toil through automation, platform improvements, and architectural changes
Performance analysis and optimization of Lirio systems and services
Ensure high availability and scalability of services through proactive engineering, load testing, and capacity planning across multi-tenant and client-specific environments
Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness
Collaborate with software engineers to embed reliability, security, and operational best practices into development workflows
Partner with software engineering teams during design and architecture discussions to identify reliability risks early
Monitor system health using modern observability tools (e.g., Prometheus, Grafana, Datadog)
Participate in a defined on-call rotation supporting production systems, with clear escalation paths and expectations
Contribute to and maintain incident severity definitions, response procedures, and no-blame postmortem practices
Lead incident response, root cause analysis, and postmortems for production issues
Triage and resolve issues, ensuring minimal downtime and rapid recovery
Support client onboarding and production rollouts by ensuring reliability, observability, and operational readiness standards are met
Mentor and coach engineers on reliability engineering principles, operational ownership, and incident response best practices
Design processes to share operational knowledge and avoid single points of failure
Advise colleagues on architecture and reliability strategies
Help establish shared operational ownership across teams to reduce single points of failure and knowledge silos
Stay current with industry trends in reliability engineering, cloud operations, and automation
Bring innovation to operational practices and system design, evaluating and introducing new tools and technologies as appropriate for Lirio
Evaluate new tooling with an emphasis on operational simplicity, security, and long-term maintainability
Define and document operational processes, incident response playbooks, and reliability standards
Contribute to operational planning, incident reviews, and reliability documentation
Qualification
Required
5-7 years related experience
Bachelor's Degree in related field
Linux systems and networking fundamentals (DNS, TCP/IP, TLS)
Distributed systems debugging and failure analysis
Load, stress, and fault-injection testing
CI/CD tools and processes
Version control (e.g., Git)
Cloud platforms (e.g., AWS, Azure)
Containers and orchestration (Kubernetes)
Kafka (messaging/streaming)
Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python)
Agile methodologies (e.g., Scrum, XP, SAFe)
Databases/SQL
Observability/monitoring tools (DataDog)
Benefits
Medical (HSA available)
Dental
Vision
Short-term & long-term disability (company-paid)
Life & AD&D (company-paid)
401K with company match
10 paid holidays, quarterly company closure dates, + holiday week company closure
Flexible time off policy
Work from home
6 weeks paid parental leave
Company
Lirio
Lirio is a behavior change AI platform that delivers behavioral engagement solutions for organizations.
Funding
Current Stage
Growth StageTotal Funding
$65MKey Investors
WR Hambrecht
2022-07-25Debt Financing· $3M
2022-05-12Series Unknown· $14.39M
2021-11-09Debt Financing· $1.36M
Recent News
2025-11-02
AstuteAnalytica India Pvt. Ltd.
2025-10-27
Company data provided by crunchbase