Apply on Employer Site

National Student Clearinghouse · 6 hours ago

Site Reliability Engineer

Herndon, VA

Full-time

Hybrid

Mid, Senior Level

$120K/yr - $140K/yr

5+ years exp

National Student Clearinghouse is a nonprofit organization that provides essential data and services for higher education. The Site Reliability Engineer will ensure the reliability, scalability, and performance of the organization's systems and services, focusing on automation and incident management to maintain high availability.

Education

No H1B

Responsibilities

Demonstrate the Clearinghouse's core competencies: Customer Focus, Optimizes Work Processes, Collaborates, Communicates Effectively, and Be Open and Authentic

Reliability Engineering & SLOs: Define SLIs/SLOs and manage error budgets; drive reliability reviews and continuous improvement to protect customer experience

Observability & Monitoring: Build and operate end-to-end observability (metrics, traces, logs, synthetics, dashboards, alerting), leveraging tools such as Datadog; tune alerts for actionability and reduce noise

Incident Management: Participate in and help coordinate incident response and on-call rotations; lead blameless post-incident reviews, root-cause analysis, and corrective action tracking

Automation & CI/CD: Partner with engineering to automate build, test, deploy, and release processes (e.g., GitLab CI) and promote progressive delivery, change safety, and rollback strategies

Infrastructure as Code & Cloud: Provision and manage cloud infrastructure with Terraform/CloudFormation on AWS/Azure/GCP; enforce configuration baselines and drift detection

Containers & Orchestration: Operate containerized workloads at scale (Kubernetes, Helm); release strategies (blue/green, canary)

Performance & Capacity: Conduct performance testing and tuning; lead capacity planning and cost-aware scaling

Security & Compliance: Embed security into pipelines and environments (e.g., IAM guardrails, policy-as-code, audit logging, vulnerability management, Wiz exposure where applicable) in partnership with DevSecOps

Runbooks & Documentation: Create and maintain runbooks, operational SOPs, and service catalogs; promote knowledge sharing and operational readiness across teams

Collaboration: Work across engineering, infrastructure, devsecops, security, and product to deliver reliable, scalable services; communicate clearly with technical and non-technical stakeholders

Continuous Improvement: Identify toil, propose experiments (e.g., chaos testing, game days), and automate repetitive operations to improve MTTR and deployment safety (DORA metrics awareness)

Perform other duties as required

Qualification

Site Reliability EngineeringCloud PlatformsCI/CD PipelinesInfrastructure as CodeMonitoring & ObservabilityIncident ManagementAutomation ScriptingContainers & OrchestrationPerformance TestingTroubleshootingContinuous ImprovementSecurity & ComplianceCollaboration

Required

Bachelors degree in Computer Science, IT or related field. A combination of education and experience including military service will also be considered

5 years in Site Reliability Engineering, DevOps, or a related role, with demonstrated expertise in cloud platforms (AWS, Azure, or GCP), automation, and system monitoring

Operating and supporting production services in cloud environments (AWS, Azure, or GCP)

Implementing and managing CI/CD pipelines (e.g., GitLab or equivalent) and progressive delivery strategies (blue/green, canary, feature flags)

Managing containerized environments using Docker and Kubernetes

Infrastructure as Code tools such as Terraform, Ansible, or CloudFormation for automated provisioning

Automation scripting with Python, Bash, or PowerShell, including configuration baseline enforcement and drift detection

Observability and monitoring (metrics, logs, traces) and actionable alerting; hands-on experience with tools like Datadog or similar

Proven ability to lead incident management, perform root-cause analysis, and conduct blameless post-incident reviews

Cloud Certification: AWS Certified DevOps Engineer or equivalent certification (e.g., Azure DevOps Engineer Expert, Google Professional Cloud DevOps Engineer)

Cloud Platforms: Proven proficiency in deploying and managing scalable infrastructure on AWS, Azure, or GCP

Programming & Automation: Strong scripting and programming skills in Python, Bash, or Go, with experience automating operational tasks and building CI/CD pipelines

Monitoring & Observability: Hands-on experience with system health and performance monitoring tools such as Prometheus, Grafana, and the ELK stack; prior experience with Datadog is strongly preferred

CI/CD & Version Control: Expertise in Git-based workflows and CI/CD tools such as Jenkins, GitLab CI, or GitHub Actions

Incident Response: Demonstrated ability to manage on-call rotations, perform root cause analysis, and lead post-mortem processes

Troubleshooting: Skilled in diagnosing complex system issues quickly and effectively

Must live within a commutable distance to Herndon, VA or in one of the Clearinghouse's approved States for hiring purposes

Must be currently authorized to work in the United States on a full-time basis. We do not intend to sponsor external applicants for work visas, and may consider sponsorship only if no qualified candidates can be found who are authorized to work without sponsorship

Must be at least 18 years old

Benefits

Comprehensive medical, dental, and vision insurance

Life and disability insurance benefits

Health care, dependent care, and limited purpose flexible spending accounts

Health savings account with annual employer contributions of $300 for employees and $600 for employees who are enrolled with their spouse and/or dependents

Voluntary supplemental health plans for Accident and Hospital Indemnity coverage

Infertility coverage

401k matching contribution program

Competitive paid leave program consisting of vacation, sick, and personal time

Paid holidays

Up to 3 weeks of paid parental leave during a 12-month period

Up to 5 days of paid military leave per calendar year

Up to 13 days of vacation and up to 10 days of sick time per year

Up to 32 hours of accrued sick time as personal time

At least 15 paid holidays per year

Reimbursement for basic wholesale company and roadside assistance memberships

Buy back on portions of unused accrued vacation based on tenure and certain other qualifications