Principal Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

NetSuite · 4 hours ago

Principal Site Reliability Engineer

NetSuite, part of Oracle, is focused on product development and strategy for Oracle Health. The Principal Site Reliability Engineer will provide technical leadership for core data platforms, ensuring reliability, scalability, and operability of mission-critical systems used across multiple products and teams.

Cloud ComputingComputerCRMiOSSaaSSoftware
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Own the end-to-end reliability, scalability, and operability of shared data platforms
Define platform standards, architectural direction, and operational guardrails
Influence cross-team technical decisions and long-term platform strategy
Drive long-term platform evolution and influence reliability strategy across the data ecosystem
Lead platform architecture and design reviews
Clearly articulate system behavior, dependencies, and failure modes
Make principled trade-offs between reliability, performance, cost, and complexity
Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively
Establish capacity models, scaling strategies, and operational best practices
Design platforms that behave predictably under load, failure, and change
Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery
Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades
Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
Treat security as a first-class architectural concern
Design and evolve an Ansible- and Terraform-driven automation framework
Treat automation as production software: versioned, reviewed, tested, and improved
Eliminate operational toil by encoding reliability and safety into the platform
Serve as the ultimate escalation point for complex or ambiguous incidents
Focus on eliminating entire classes of failure, not just resolving individual issues
Represent SRE and platform engineering in high-visibility and sensitive forums
Communicate clearly with engineering leadership and partner teams

Qualification

Distributed systems expertiseHadoop ecosystem componentsInfrastructure-as-CodeAutomation frameworksLinux troubleshootingScripting languagesSecurity architectureEffective communicationTeam collaboration

Required

8+ years operating large-scale, customer-facing distributed platforms
Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems
Strong background in Linux, networking, and distributed system troubleshooting
Infrastructure-as-Code using Ansible and Terraform
Scripting and automation using Python, Ruby, and Bash
Hands-on experience operating Kerberized environments
Proven ability to define and document technical architecture for complex systems
Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers
Experience designing observability and capacity models for distributed platforms
U.S. Citizenship and eligibility for a Federal Security Clearance
10+ years of technical experience relevant to this position
Ability to communicate effectively and build rapport with team members
BS or MS in Computer Science, or equivalent

Benefits

Medical, dental, and vision insurance, including expert medical opinion
Short term disability and long term disability
Life insurance and AD&D
Supplemental life insurance (Employee/Spouse/Child)
Health care and dependent care Flexible Spending Accounts
Pre-tax commuter and parking benefits
401(k) Savings and Investment Plan with company match
Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
11 paid holidays
Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
Paid parental leave
Adoption assistance
Employee Stock Purchase Plan
Financial planning and group legal
Voluntary benefits including auto, homeowner and pet insurance

Company

NetSuite

company-logo
NetSuite is cloud computing company dedicated to delivering business applications over the internet.

Funding

Current Stage
Public Company
Total Funding
$157.79M
Key Investors
Meritech Capital PartnersTako VenturesStarVest Partners
2016-07-28Acquired
2007-12-20IPO
2007-02-05Secondary Market· $17.87M

Leadership Team

leader-logo
Brian Chess
SVP Technology and AI
linkedin
E
Eli Johnson
Vice President, Global Sales Productivity
linkedin
Company data provided by crunchbase