Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

National Student Clearinghouse ยท 6 hours ago

Site Reliability Engineer

National Student Clearinghouse is a nonprofit organization that provides essential data and services for higher education. The Site Reliability Engineer will ensure the reliability, scalability, and performance of the organization's systems and services, focusing on automation and incident management to maintain high availability.

Education
badNo H1Bnote

Responsibilities

Demonstrate the Clearinghouse's core competencies: Customer Focus, Optimizes Work Processes, Collaborates, Communicates Effectively, and Be Open and Authentic
Reliability Engineering & SLOs: Define SLIs/SLOs and manage error budgets; drive reliability reviews and continuous improvement to protect customer experience
Observability & Monitoring: Build and operate end-to-end observability (metrics, traces, logs, synthetics, dashboards, alerting), leveraging tools such as Datadog; tune alerts for actionability and reduce noise
Incident Management: Participate in and help coordinate incident response and on-call rotations; lead blameless post-incident reviews, root-cause analysis, and corrective action tracking
Automation & CI/CD: Partner with engineering to automate build, test, deploy, and release processes (e.g., GitLab CI) and promote progressive delivery, change safety, and rollback strategies
Infrastructure as Code & Cloud: Provision and manage cloud infrastructure with Terraform/CloudFormation on AWS/Azure/GCP; enforce configuration baselines and drift detection
Containers & Orchestration: Operate containerized workloads at scale (Kubernetes, Helm); release strategies (blue/green, canary)
Performance & Capacity: Conduct performance testing and tuning; lead capacity planning and cost-aware scaling
Security & Compliance: Embed security into pipelines and environments (e.g., IAM guardrails, policy-as-code, audit logging, vulnerability management, Wiz exposure where applicable) in partnership with DevSecOps
Runbooks & Documentation: Create and maintain runbooks, operational SOPs, and service catalogs; promote knowledge sharing and operational readiness across teams
Collaboration: Work across engineering, infrastructure, devsecops, security, and product to deliver reliable, scalable services; communicate clearly with technical and non-technical stakeholders
Continuous Improvement: Identify toil, propose experiments (e.g., chaos testing, game days), and automate repetitive operations to improve MTTR and deployment safety (DORA metrics awareness)
Perform other duties as required

Qualification

Site Reliability EngineeringCloud PlatformsCI/CD PipelinesInfrastructure as CodeMonitoring & ObservabilityIncident ManagementAutomation ScriptingContainers & OrchestrationPerformance TestingTroubleshootingContinuous ImprovementSecurity & ComplianceCollaboration

Required

Bachelors degree in Computer Science, IT or related field. A combination of education and experience including military service will also be considered
5 years in Site Reliability Engineering, DevOps, or a related role, with demonstrated expertise in cloud platforms (AWS, Azure, or GCP), automation, and system monitoring
Operating and supporting production services in cloud environments (AWS, Azure, or GCP)
Implementing and managing CI/CD pipelines (e.g., GitLab or equivalent) and progressive delivery strategies (blue/green, canary, feature flags)
Managing containerized environments using Docker and Kubernetes
Infrastructure as Code tools such as Terraform, Ansible, or CloudFormation for automated provisioning
Automation scripting with Python, Bash, or PowerShell, including configuration baseline enforcement and drift detection
Observability and monitoring (metrics, logs, traces) and actionable alerting; hands-on experience with tools like Datadog or similar
Proven ability to lead incident management, perform root-cause analysis, and conduct blameless post-incident reviews
Cloud Certification: AWS Certified DevOps Engineer or equivalent certification (e.g., Azure DevOps Engineer Expert, Google Professional Cloud DevOps Engineer)
Cloud Platforms: Proven proficiency in deploying and managing scalable infrastructure on AWS, Azure, or GCP
Programming & Automation: Strong scripting and programming skills in Python, Bash, or Go, with experience automating operational tasks and building CI/CD pipelines
Monitoring & Observability: Hands-on experience with system health and performance monitoring tools such as Prometheus, Grafana, and the ELK stack; prior experience with Datadog is strongly preferred
CI/CD & Version Control: Expertise in Git-based workflows and CI/CD tools such as Jenkins, GitLab CI, or GitHub Actions
Incident Response: Demonstrated ability to manage on-call rotations, perform root cause analysis, and lead post-mortem processes
Troubleshooting: Skilled in diagnosing complex system issues quickly and effectively
Must live within a commutable distance to Herndon, VA or in one of the Clearinghouse's approved States for hiring purposes
Must be currently authorized to work in the United States on a full-time basis. We do not intend to sponsor external applicants for work visas, and may consider sponsorship only if no qualified candidates can be found who are authorized to work without sponsorship
Must be at least 18 years old

Benefits

Comprehensive medical, dental, and vision insurance
Life and disability insurance benefits
Health care, dependent care, and limited purpose flexible spending accounts
Health savings account with annual employer contributions of $300 for employees and $600 for employees who are enrolled with their spouse and/or dependents
Voluntary supplemental health plans for Accident and Hospital Indemnity coverage
Infertility coverage
401k matching contribution program
Competitive paid leave program consisting of vacation, sick, and personal time
Paid holidays
Up to 3 weeks of paid parental leave during a 12-month period
Up to 5 days of paid military leave per calendar year
Up to 13 days of vacation and up to 10 days of sick time per year
Up to 32 hours of accrued sick time as personal time
At least 15 paid holidays per year
Reimbursement for basic wholesale company and roadside assistance memberships
Buy back on portions of unused accrued vacation based on tenure and certain other qualifications
Employee Education Assistance Program
Enterprise-wide LinkedIn Learning subscription
Mental health with up to eight free therapy sessions for employees and their family members
Well-being reward benefits
Service credit towards the Public Service Loan Forgiveness program (PSLF)

Company

National Student Clearinghouse

company-logo
The Clearinghouse helps educational institutions improve efficiency, reduce costs and workload, and enhance the quality of service.

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Ricardo (Rick) Torres
President & CEO
linkedin
leader-logo
Erin Seraydian
Chief Financial Officer
linkedin
Company data provided by crunchbase