PENNYMAC · 5 hours ago
Site Reliability Operations (SRO) Engineer III
Pennymac is a specialty financial services firm focused on the production and servicing of U.S. mortgage loans. The Site Reliability Operations Engineer III will provide 24/7 monitoring and support of the company’s IT Infrastructure, ensuring operational excellence and improving observability across the organization.
BankingFinanceFinancial ServicesLendingMortgage
Responsibilities
Oversee 24/7 health monitoring of the company’s IT Infrastructure using tools such as AWS CloudWatch and New Relic
Own the ongoing refinement of operational alerts
Implement advanced alerting rules and thresholds to proactively identify issues, reduce noise, and ensure every alert drives action
Partner with Incident Management to identify monitoring and alerting gaps discovered during incident triage; prioritize and implement enhancements to prevent recurrence
Serve as an observability resource to application teams, assessing current instrumentation and providing actionable recommendations to improve monitoring maturity
Lead initiatives to reduce alert noise, improve signal-to-noise ratio, and ensure every alert is actionable with clear runbook linkage
Design and maintain operationally-focused dashboards in New Relic that support 24/7 triage, SLA tracking, and real-time incident response
Serve as an escalation point for complex incidents
Collaborate closely with the Incident Management team, Application Developers, Internal Support Teams, and 3rd Party Vendors to ensure timely and accurate resolution of service disruptions
Perform and troubleshoot a wide range of administrative tasks across Windows and Linux environments
Assist in optimizing system performance, conducting root-cause analyses, and implementing long-term fixes
Handle more complex tasks associated with maintaining and troubleshooting the company’s virtual infrastructure
Provide guidance to junior engineers for routine issues
Tackle advanced technical issues that are escalated from Engineer I/II
Conduct deep dives into infrastructure and application logs to pinpoint underlying problems
Act as a liaison between multiple internal teams and external vendors for high-priority incidents
Ensure swift coordination and minimize downtime
Strictly follow and help refine the company’s established Change Management processes
Provide risk assessments and validation for proposed changes before approval
Monitor and respond to incoming Calls, Chats, and Emails directed to the SRO team
Offer structured feedback to stakeholders when complex issues are underway
Lead by example in managing multiple ticket queues (ServiceNow, JIRA, etc.)
Take ownership of priority tickets and oversee distribution among the team
Maintain and expand the SRO team’s knowledge base
Author new Standard Operating Procedures (SOPs) that incorporate best practices gained from resolving advanced incidents
Coordinate and execute application and website code deployments using Jenkins, GitLab, or other CI/CD tools
Help optimize deployment workflows to reduce errors and downtime
Oversee backup tasks using CommVault, AWS Backup, and related tools
Ensure data retention meets or exceeds corporate and regulatory requirements
Drive or co-lead medium to large-scale projects related to infrastructure improvements, migrations, or optimizations
Collaborate with stakeholders to define scope, timelines, and resource needs
Provide guidance to Engineer I and II staff on advanced troubleshooting methods, best practices in cloud administration, and effective incident response
Qualification
Required
Bachelor's Degree in Computer Science or comparable experience
3-5+ years of experience working in both Windows and Linux environments, with demonstrated success in advanced troubleshooting and administration
Hands-on experience with New Relic (dashboards, NRQL queries, alerting configuration)
Demonstrated success improving monitoring coverage and alert quality
Ability to consult with application teams on observability best practices
Strong analytical skills for identifying patterns in incident data
Strong scripting or programming skills in PowerShell, Python, or a similar language; ability to automate repetitive tasks and streamline operations
Excellent organizational skills, with the ability to manage competing priorities and urgent issues in a fast-paced setting
Strong written and verbal communication skills; able to explain complex technical issues to stakeholders at various technical levels
Comfortable completing annual role-based training and certification assignments; dedicated to continual learning and development
Demonstrated ability to work independently on complex tasks and to collaborate effectively with cross-functional teams
Preferred
Advanced AWS Certifications strongly preferred
Benefits
Comprehensive Medical, Dental, and Vision
Paid Time Off Programs including vacation, holidays, illness, and parental leave
Wellness Programs, Employee Recognition Programs, and onsite gyms and cafe style dining (select locations)
Retirement benefits, life insurance, 401k match, and tuition reimbursement
Philanthropy Programs including matching gifts, volunteer grants, charitable grants and corporate sponsorships
We value the hard work and dedication of our employees. In addition to a competitive salary, positions may offer bonus opportunities.
Company
PENNYMAC
Pennymac is a home loan lending company that offers financial services.
H1B Sponsorship
PENNYMAC has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (48)
2024 (42)
2023 (33)
2022 (44)
2021 (65)
2020 (34)
Funding
Current Stage
Public CompanyTotal Funding
$2.33B2025-12-11Post Ipo Debt· $75M
2025-08-07Post Ipo Debt· $650M
2024-05-20Post Ipo Debt· $850M
Leadership Team
Recent News
Company data provided by crunchbase