Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

UFG Insurance ยท 2 days ago

Site Reliability Engineer

UFG Insurance is currently hiring for a Site Reliability Engineer who will be the senior-most engineer on the Production Management team, responsible for ensuring the reliability, performance, scalability, and efficiency of critical production systems and services. The role involves troubleshooting, operating, and enhancing distributed systems while providing guidance and support to technology teams across Business Enablement.

Financial ServicesInsurance
check
H1B Sponsor Likelynote
Hiring Manager
Meghan Larsen, PHR
linkedin

Responsibilities

Implement tooling to monitor system health, capacity, and performance at all levels, from hardware through the VMs and all the way to the end-user interface
Work with the production management team to troubleshoot incidents, restore service, and identify root causes
Recommend architectural and implementation of changes to products delivered by development teams based on their performance in test, performance, and production environments
Support continuous improvement of ITIL processes through automation, data driven insights, and proactive problem identification
Documents and Integrate SRE practices into the ITIL framework, including incident, change, and problem management workflows
Develop automation for system provisioning, monitoring, deployment, and recovery to reduce manual effort and human error
Develop and maintain comprehensive runbooks, standard operating procedures (SOPs), and knowledge base articles for recurring operational tasks and incident response actions
Collaborate with development teams to design resilient architecture and implement best practices for reliability and observability
Enhance observability by developing and maintaining dashboards, alerts, and performance analytics
Contribute to capacity planning, performance tuning, and resilience testing to ensure system health
Develop and update problem management documentation, ensuring known errors and workarounds are captured within the ITSM system
Manage incident response and participate in on-call rotations to ensure service reliability
Define, document and track key reliability metrics (SLIs, SLOs, SLAs) and implement continuous improvement initiatives
Drive post-incident reviews (PIRs) and develop actionable insights to prevent future occurrences
Partner with security teams to ensure systems meet compliance, security, and governance standards
Evaluate and recommend new tools, technologies, and frameworks to improve operational efficiency
Monitor network systems, servers, and applications
Contribute to capacity planning, performance tuning, and resilience testing to ensure system health
Use all necessary tools to investigate performance and reliability of systems in testing environments. Provide detailed and specific guidance on ways to eliminate bottlenecks, improve resilience, and optimize speed and reliability
Provide mentorship and technical support to other members of Production Management

Qualification

Site Reliability EngineeringMonitoring toolsAutomationScriptingNetworking conceptsITIL processesSQL Server expertiseVM performance tuningCommunication skillsProblem-solving skillsCollaboration skills

Required

Bachelor's degree in information technology, Computer Science, or a related field, or equivalent experience
10+ years of experience in progressively more demanding enterprise-scale technology roles
3+ years of experience as a Site Reliability Engineer or Senior DevOps Engineer
3+ years in software development, architecture, or related engineering discipline
Advanced experience with multiple enterprise monitoring and observability tools, including Dynatrace, PRTG, DTrace, SolarWinds, and similar
Complete Windows fluency mandatory; similar strengths in LINUX and Unisys Mainframe environments helpful
Excellent problem-solving and communication skills, with the ability to collaborate across cross-functional teams
Unparalleled understanding of advanced networking concepts and complete expertise in the entire TCP/IP stack
VM (VMware and HyperV) and physical compute performance and tuning, including networking and storage performance
VM (Java, Python, Browser, and similar VM environments) threading, garbage collection, and general performance
SQL Server expertise, including troubleshooting queries, indexes, and general performance
Experience with unstructured database performance
General understanding of LLM/SLM implementations and GPU implementations
Proficiency in automation and scripting languages
Good understanding of ITIL processes (Incident, Change, Problem, and Service Level Management)

Preferred

Master's or other advanced degree preferred

Benefits

Annual incentive compensation
Medical, dental, vision & life insurance
Accident, critical Illness & short-term disability insurance
Retirement plans with employer contributions
Generous time-off program
Programs designed to support the employee well-being and financial security.

Company

UFG Insurance

twittertwitter
company-logo
The United Fire Group (UFG) companies join together to offer a range of property/casualty products.

H1B Sponsorship

UFG Insurance has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)
2024 (1)
2023 (3)
2021 (3)

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Kevin Leidwinger
President and Chief Executive Officer
linkedin
leader-logo
Randy Ramlo
Chief Executive Officer
linkedin
Company data provided by crunchbase