Apply on Employer Site

NetSuite · 1 month ago

Senior Principal Site Reliability Engineer | Oracle Health Federal Operations Team

United States

Full-time

Remote

Senior Level, Lead/Staff

NetSuite is a technology leader that’s changing how the world does business. They are seeking a Senior Principal Site Reliability Engineer to define and deploy key services, focusing on architecture, production operations, and automation to enhance the reliability and performance of Oracle Health's platform.

Cloud ComputingComputerCRMiOSSaaSSoftware

No H1B

Security Clearance Required

U.S. Citizen Only

Responsibilities

Own the full service lifecycle: design, implementation, deployment, on-call, and continuous improvement—maintaining high code and reliability standards

Define and meet service-level objectives (availability, latency, durability) while reducing toil through automation, observability, and self-healing mechanisms

Lead architecture, analysis, design, implementation, and production operations for Core System Framework solutions, with strong documentation and runbooks

Create and maintain clear, version-controlled documentation—architectural diagrams, SOPs, runbooks, and incident playbooks—to ensure repeatable operations, auditability, and fast onboarding

Design, write, and deploy software that improves the availability, scalability, and efficiency of platform services

Develop designs, architectures, standards, and methods for large-scale distributed systems

Build automation to prevent problem recurrence; drive real-time monitoring, alerting, and self-healing into production systems

Conduct capacity planning and demand forecasting; perform software performance analysis, system tuning, and optimization

Contribute to and support platform services across architecture, provisioning, configuration, deployment, and ongoing operations

Partner with distributed teams to prototype and launch new platform services

Stay current on emerging technologies and introduce innovations that improve reliability, security, and developer productivity

Mentor and guide engineers in distributed systems design, high-scale data processing, and operational excellence

Set and raise engineering standards across multiple teams; model best practices in reliability, security, and automation

Collaborate closely with storage, networking, observability, and security teams to deliver platform features and secure-by-default designs

Participate in an on-call rotation; lead incident response, postmortems, and follow-through on corrective actions to drive continuous improvement

Qualification

Site Reliability EngineeringDevOps PracticesDistributed Systems DesignAutomationPerformance ManagementIncident ResponseCapacity PlanningMentoringCollaboration

Required

Applicants are required to read, write, and speak the following languages: English

Does this position require a security clearance?: Yes