ERCOT · 21 hours ago
Site Reliability Engineer (Java focused) Sr or Lead
ERCOT is seeking a Senior or Lead Site Reliability Engineer (SRE) with strong Java application expertise to ensure the availability, performance, and reliability of mission-critical systems. This role involves managing site failover, debugging Java applications, and leading incident response efforts.
Energy
Responsibilities
Own reliability, availability, latency, and scalability of Java-based systems
Define and track SLIs, SLOs, and error budgets
Design and maintain monitoring, alerting, logging, and dashboards
Lead incident response and conduct blameless postmortems
Reduce operational toil through automation and tooling
Review system designs for reliability and failure modes
(Lead level) Establish reliability standards and mentor engineers
Debug and improve Java applications (Spring Boot preferred)
Perform JVM tuning and performance analysis
Diagnose failures across databases, messaging, and APIs
Partner with development teams to improve resilience and recovery
Participate in an on-call rotation for supported services
Focus on engineering solutions rather than repetitive manual work
Emphasis on post-incident learning and automation
Toil is tracked and actively reduced
Qualification
Required
5+ years (Senior) or 10+ years (Lead) in SRE, DevOps, or Production Engineering
Strong Java experience (Spring-based systems)
Experience with distributed, high-availability systems
Expertise in observability tools (metrics, logs, traces)
CI/CD experience (Git, Maven, Jenkins)
Strong cross-layer debugging skills
CS or related degree required
Own reliability, availability, latency, and scalability of Java-based systems
Define and track SLIs, SLOs, and error budgets
Design and maintain monitoring, alerting, logging, and dashboards
Lead incident response and conduct blameless postmortems
Reduce operational toil through automation and tooling
Review system designs for reliability and failure modes
(Lead level) Establish reliability standards and mentor engineers
Debug and improve Java applications (Spring Boot preferred)
Perform JVM tuning and performance analysis
Diagnose failures across databases, messaging, and APIs
Partner with development teams to improve resilience and recovery
Participate in an on-call rotation for supported services
Focus on engineering solutions rather than repetitive manual work
Emphasis on post-incident learning and automation
Toil is tracked and actively reduced
Strong hands-on experience with observability and APM platforms such as Splunk, Dynatrace, DataDog
Expertise in using Metrics, Logs, Traces, and Profiling (MLTP) to troubleshoot complex production incidents
Experience with Grafana LGTM Stack for Observability (Loki - for logs, Grafana - for dashboards and visualization, Tempo - for traces, and Mimir - for metrics)
Experience correlating application performance data with system behavior to identify root causes and prevent recurrence
Preferred
Python
Kubernetes or OpenShift
Microsoft Azure
Kafka or ActiveMQ
Infrastructure automation (Terraform, Azure Resource Manager, Ansible, Liquibase)
Chaos or load testing experience
Company
ERCOT
The Electric Reliability Council of Texas (ERCOT) manages the flow of electric power to 24 million Texas customers.
H1B Sponsorship
ERCOT has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (15)
2024 (18)
2023 (17)
2022 (30)
2021 (19)
2020 (20)
Funding
Current Stage
Late StageRecent News
2025-11-08
2025-10-25
2025-09-28
Company data provided by crunchbase