Senior HPC DevOps Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Peraton · 3 weeks ago

Senior HPC DevOps Engineer

Peraton is a next-generation national security company that drives missions of consequence spanning the globe. They are seeking a Senior HPC DevOps Engineer to own the operations and automation lifecycle for an existing HPC/AI compute cluster, working closely with team members and a Maryland-based customer.

Information TechnologyRobotics
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Own and manage automation workflows, including job templates, inventories, credentials, RBAC configurations, execution environments, and promotion across environments
Enforce desired state across cluster services via code-driven configuration; implement drift detection and alert on deviations; reconcile runtime state vs configured state
Build and maintain an automated node bootstrap workflow that installs/configures the OS, applies security and performance baselines, enrolls nodes into the scheduler and shared storage ecosystem, validates hardware and service readiness (CPU, network, accelerator, storage mounts), and reports pass/fail results
Implement rolling maintenance and patch automation to meet defined vulnerability response SLAs. Maintain version-controlled container build definitions and integrate image scanning into the build/release lifecycle
Ensure automation and operational workflows emit auditable logs to centralized analytics and integrate with metrics/alerting to enable reliable incident response, proactive detection, and safe auto-remediation
Automate responses to common incidents (hung nodes, storage performance alarms, image vulnerabilities, hardware failures) leveraging out-of-band hardware management interfaces and standardized runbooks
Keep runbooks and operational documentation versioned alongside automation and publish operator guidance to the orgs documentation platform

Qualification

HPC operationsAnsible automationLinux systemsIncident responseContainer toolingGit workflowTechnical certificationBare-metal provisioningPerformance troubleshootingDocumentation practices

Required

12+ years of experience and a BS in computer science, IT, or related technical field, MS and 10 years of experience, or a Ph.D. with 8 years of experience. Four years of additional experience is required in lieu of a Bachelors' degree for a total of 16 years of experience
7+ years in Linux systems / SRE / DevOps, including production cluster operations in an HPC or large-scale compute environment
3+ years of experience building and operating Ansible automation at scale (roles/collections, idempotency, inventories, secrets)
Strong Linux hardening & compliance fundamentals (SELinux/AppArmor, SSH key automation, baseline config management)
Demonstrated experience operating or automating clustered compute environments (HPC, large Linux farms, or similar)
Hands-on experience with container tooling in Linux environments, including image lifecycle/versioning
Familiarity with incident response and runbook-driven operations; ability to automate common remediations
Strong Git workflow and documentation practices
Must hold at least one active/current technical certification from the following- Systems engineering (e.g., INCOSE), Information security (e.g., CISSP), Networking (e.g., CCNA), System Administration (e.g., RHCE, MCSE), Virtualization (e.g., VCP), IT systems management (e.g., ITIL), Project management (e.g., PMP, Agile)
This position requires an active/current TS/SCI w/ Polygraph

Preferred

Bare-metal provisioning experience (PXE/iPXE, Kickstart/Preseed, Foreman/MAAS) and hardware OOB management
CI/testing for automation and promotion pipelines for playbooks
Experience with tuned performance profiles, HPC performance troubleshooting, and GPU node health validation

Benefits

Overtime
Shift differential
Discretionary bonus

Company

Peraton Fearlessly solving the toughest national security challenges.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Thomas Terjesen
Chief Information Officer
linkedin
Company data provided by crunchbase