Oracle · 5 hours ago
Principal Site Reliability Engineer
Oracle is a world leader in cloud solutions, and they are seeking a Principal Site Reliability Engineer to provide technical leadership for the core data platforms behind Oracle Health’s Data & Analytics Platform. The role involves owning the reliability, scalability, and operability of shared data platforms while leading the design and operation of large-scale distributed systems.
Data GovernanceData ManagementEnterprise SoftwareInformation TechnologySaaSSoftware
Responsibilities
Own the end-to-end reliability, scalability, and operability of shared data platforms
Define platform standards, architectural direction, and operational guardrails
Influence cross-team technical decisions and long-term platform strategy
Drive long-term platform evolution and influence reliability strategy across the data ecosystem
Lead platform architecture and design reviews
Clearly articulate system behavior, dependencies, and failure modes
Make principled trade-offs between reliability, performance, cost, and complexity
Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively
Establish capacity models, scaling strategies, and operational best practices
Design platforms that behave predictably under load, failure, and change
Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery
Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical
Reason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgrades
Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication
Treat security as a first-class architectural concern
Design and evolve an Ansible- and Terraform-driven automation framework
Treat automation as production software: versioned, reviewed, tested, and improved
Eliminate operational toil by encoding reliability and safety into the platform
Serve as the ultimate escalation point for complex or ambiguous incidents
Focus on eliminating entire classes of failure, not just resolving individual issues
Represent SRE and platform engineering in high-visibility and sensitive forums
Communicate clearly with engineering leadership and partner teams
Qualification
Required
8+ years operating large-scale, customer-facing distributed platforms
Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems
Strong background in Linux, networking, and distributed system troubleshooting
Infrastructure-as-Code using Ansible and Terraform
Scripting and automation using Python, Ruby, and Bash
Hands-on experience operating Kerberized environments
Proven ability to define and document technical architecture for complex systems
Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers
Experience designing observability and capacity models for distributed platforms
U.S. Citizenship and eligibility for a Federal Security Clearance
10+ years of technical experience relevant to this position
Ability to communicate effectively and build rapport with team members
BS or MS in Computer Science, or equivalent
Benefits
Flexible medical
Life insurance
Retirement options
Volunteer programs
Company
Oracle
Oracle is an integrated cloud application and platform services that sells a range of enterprise information technology solutions.
Funding
Current Stage
Public CompanyTotal Funding
$25.75BKey Investors
Sequoia Capital
2025-09-24Post Ipo Debt· $18B
2025-02-03Post Ipo Debt· $7.75B
1986-03-12IPO
Leadership Team
Recent News
2026-01-14
2026-01-14
Tech Startups - Tech News, Tech Trends & Startup Funding
2026-01-14
Company data provided by crunchbase