TAG - The Aspen Group · 3 hours ago
Senior Site Reliability Engineer
TAG - The Aspen Group is one of the largest retail healthcare business support organizations in the U.S., dedicated to improving healthcare experiences. As a Senior Site Reliability Engineer, you will ensure the reliability, performance, and scalability of core systems while integrating AI and machine learning into operational workflows to solve complex reliability challenges.
Responsibilities
Design and build highly scalable and resilient systems to support our applications and services, incorporating predictive analytics to anticipate reliability risks
Develop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using machine learning anomaly detection to ensure systems meet reliability targets
Drive improvements in system reliability, availability, and performance through proactive measures, automation, and intelligent failure prediction
Implement and manage comprehensive monitoring and alerting solutions, integrating with intelligent observability platforms that reduce alert noise and correlate events
Develop and maintain dashboards and reporting tools that provide data-driven insights for actionable troubleshooting recommendations and performance optimization
Evaluate and integrate advanced monitoring tools and operational intelligence platforms to enhance observability and root cause identification
Lead and participate in incident response efforts, using intelligent log analysis and automated event correlation to speed up troubleshooting and root cause identification
Develop and maintain incident management processes incorporating automated decision support systems to improve response times and minimize service disruptions
Conduct post-incident reviews, using automated pattern recognition and trend analysis to identify systemic issues and implement preventive measures
Analyze performance metrics and logs, supported by advanced observability tools, to detect bottlenecks and inefficiencies
Collaborate with development teams to implement automated profiling and optimization recommendations for code and infrastructure improvements
Perform capacity planning using machine learning forecasting models to ensure systems can handle current and future loads
Develop and implement automation solutions, including intelligent runbook automation, self-healing systems, and automated incident triage
Identify and drive process improvements by applying machine learning to operational data for continuous optimization
Maintain documentation that includes automation and machine learning guidelines for monitoring, incident management, and SRE best practices
Work closely with engineering, operations, and product teams to align reliability and monitoring goals, including automation adoption strategies
Communicate effectively with stakeholders, providing regular updates on system health, incidents, performance improvements, and data-driven insights
Foster a culture of collaboration, knowledge sharing, and automation best practices within the team and across the organization
Qualification
Required
Bachelor's degree in computer science or a related technical field
At least 5 years of experience in Site Reliability Engineering or a similar role
Strong proficiency in at least one programming language such as Python, Go, or C#
Demonstrated experience applying machine learning and automation to operational workflows such as monitoring, alerting and incident response
Expertise with infrastructure as code tools such as Terraform
Proven experience working and monitoring container environments such as Cloud Run and Kubernetes
Hands-on experience using and working within an Azure, AWS, and GCP environment (GCP preferred)
Strong understanding of networking, distributed systems, and cloud infrastructure
Familiarity with intelligent monitoring platforms and operational analytics tools such as Prometheus, Grafana, OpenSearch, Sentry, Google Cloud Observability
Excellent problem-solving skills and the ability to work independently and as part of a team
Experience with incident management, root cause analysis, and automated operational workflows
Benefits
Paid time off
Health
Dental
Vision
401(k) savings plan with match
Company
TAG - The Aspen Group
When we launched Aspen Dental, we set out to break down the barriers that made it hard for patients to keep up with their dental health — affordability, transparency, and access.
H1B Sponsorship
TAG - The Aspen Group has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2024 (3)
2023 (20)
2022 (16)
2021 (14)
2020 (7)
Funding
Current Stage
Late StageRecent News
Company data provided by crunchbase