Apply on Employer Site

NetSuite · 4 hours ago

Principal Site Reliability Engineer

United States

Full-time

Remote

Lead/Staff

$86K/yr - $200K/yr

8+ years exp

NetSuite, part of Oracle, is focused on product development and strategy for Oracle Health. The Principal Site Reliability Engineer will provide technical leadership for core data platforms, ensuring reliability, scalability, and operability of mission-critical systems used across multiple products and teams.

Cloud ComputingComputerCRMiOSSaaSSoftware

No H1B

Security Clearance Required

U.S. Citizen Only

Responsibilities

Own the end-to-end reliability, scalability, and operability of shared data platforms

Define platform standards, architectural direction, and operational guardrails

Influence cross-team technical decisions and long-term platform strategy

Drive long-term platform evolution and influence reliability strategy across the data ecosystem

Lead platform architecture and design reviews

Clearly articulate system behavior, dependencies, and failure modes

Make principled trade-offs between reliability, performance, cost, and complexity

Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively

Establish capacity models, scaling strategies, and operational best practices

Design platforms that behave predictably under load, failure, and change

Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery

Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical

Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades

Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication

Treat security as a first-class architectural concern

Design and evolve an Ansible- and Terraform-driven automation framework

Treat automation as production software: versioned, reviewed, tested, and improved

Eliminate operational toil by encoding reliability and safety into the platform

Serve as the ultimate escalation point for complex or ambiguous incidents

Focus on eliminating entire classes of failure, not just resolving individual issues

Represent SRE and platform engineering in high-visibility and sensitive forums

Communicate clearly with engineering leadership and partner teams

Qualification

Distributed systems expertiseHadoop ecosystem componentsInfrastructure-as-CodeAutomation frameworksLinux troubleshootingScripting languagesSecurity architectureEffective communicationTeam collaboration

Required

8+ years operating large-scale, customer-facing distributed platforms

Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems

Strong background in Linux, networking, and distributed system troubleshooting

Infrastructure-as-Code using Ansible and Terraform

Scripting and automation using Python, Ruby, and Bash

Hands-on experience operating Kerberized environments

Proven ability to define and document technical architecture for complex systems

Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers

Experience designing observability and capacity models for distributed platforms

U.S. Citizenship and eligibility for a Federal Security Clearance

10+ years of technical experience relevant to this position

Ability to communicate effectively and build rapport with team members

BS or MS in Computer Science, or equivalent

Benefits

Medical, dental, and vision insurance, including expert medical opinion

Short term disability and long term disability

Life insurance and AD&D

Supplemental life insurance (Employee/Spouse/Child)

Health care and dependent care Flexible Spending Accounts

Pre-tax commuter and parking benefits

401(k) Savings and Investment Plan with company match

Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.

11 paid holidays

Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.

Paid parental leave

Adoption assistance

Employee Stock Purchase Plan

Financial planning and group legal

Voluntary benefits including auto, homeowner and pet insurance