Oak Ridge National Laboratory · 2 weeks ago
Senior Linux HPC Storage Engineer
Oak Ridge National Laboratory (ORNL) is a U.S. Department of Energy national laboratory focused on addressing the nation’s pressing challenges. The Senior Linux HPC Storage Engineer will design, operate, and maintain large-scale HPC storage systems to support research excellence and delivery for the ORNL community.
Advanced MaterialsClean EnergyEnergyEnergy ManagementManufacturingNuclearRenewable Energy
Responsibilities
Architect, deploy, and manage large-scale HPC storage systems, including parallel file systems such as Lustre, GPFS/Spectrum Scale, BeeGFS and WEKA
Design, implement, and operate large-scale Ceph storage clusters for HPC and research workloads, delivering reliable, high-performance object, block, and file storage services
Ensure the availability, performance, scalability, and security of production storage environments
Administer and optimize enterprise storage platforms such as Qumulo and NetApp in support of HPC and research workloads
Design, deploy, and maintain archival storage solutions including Spectra Logic BlackPearl and large-scale tape libraries to ensure long-term data preservation and accessibility
Integrate high-performance, enterprise, and archival storage layers into cohesive tiered storage architectures that balance cost, scalability, and performance for diverse scientific workflows
Leverage automation and monitoring solutions to minimize day-to-day maintenance while identifying opportunities to optimize system performance and management
Collaborate with researchers and technical POCs to support large data workflows and optimize I/O performance for scientific workloads
Automate storage provisioning, monitoring, and maintenance using scripting and configuration management tools
Diagnose and resolve complex storage and I/O-related issues in high-throughput, low-latency HPC environments
Evaluate emerging storage technologies (NVMe, object storage, hierarchical storage management, burst buffers) and contribute to strategic planning for future HPC systems
Work with 24/7 operations staff to streamline monitoring and troubleshooting, significantly reducing the need for off-hours support
Deliver ORNL’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote equal opportunity by fostering a respectful workplace – in how we treat one another, work together, and measure success
Qualification
Required
A BS degree in computer science, computer engineering, information technology, information systems, science, engineering, business, or a related discipline and a minimum of eight (8) to twelve (12) years of aligned professional experience is required for consideration. An overall combination of equivalent education and experience may be considered
Masters and PhD degree holders in the same fields of study are also encouraged to apply
Masters' holders should have a minimum of seven (7) to ten (10) years of relevant and aligned experience
PhD holders should have a minimum of four (4) to six (6) years of relevant and aligned experience
Five (5) or more years managing UNIX/Linux systems
Demonstrated experience managing HPC storage and large-scale enterprise storage systems
Three (3) or more years working with configuration management and automation tools such as Git, Jenkins, Ansible, or Puppet
Proficiency with at least one scripting language (Bash, Python, Perl, etc.)
Strong Linux administration and advanced troubleshooting experience
Experience supporting large data systems and/or HPC scientific workloads
Strong desire to innovate and evaluate new technologies for HPC and storage environments
Collaborative approach and ability to become a trusted advisor to research teams
Preferred
Active DOE Q, DoD Top Secret, or TS/SCI clearance is strongly preferred
Solid understanding of multiple operating systems and HPC cluster technologies
Experience with Rocky/CentOS/RHEL, Ubuntu, VMware
Understanding of HPC job schedulers (SLURM) and user support workflows
Experience with container technologies in HPC environments
Experience with multiple system deployment mechanisms (Warewulf, PXEboot, Cobbler, Bright)
Experience with GPU clusters (NVIDIA, AMD) for AI/ML and scientific workloads
Deep expertise with high-performance parallel file systems (Lustre, GPFS/Spectrum Scale, BeeGFS, WEKA)
Knowledge of storage networking (Infiniband, NVMe-oF, SAN/NAS architectures)
Familiarity with RAID, ZFS, and object storage technologies
Strong background in performance monitoring, benchmarking, and I/O optimization
Experience with monitoring systems such as Grafana, CheckMK, Nagios, Zabbix, Ganglia
Previous experience working in a government, scientific, or other highly technical environment
Strong documentation skills and ability to prepare web-based documentation
Benefits
Prescription Drug Plan
Dental Plan
Vision Plan
401(k) Retirement Plan
Contributory Pension Plan
Life Insurance
Disability Benefits
Generous Vacation and Holidays
Parental Leave
Legal Insurance with Identity Theft Protection
Employee Assistance Plan
Flexible Spending Accounts
Health Savings Accounts
Wellness Programs
Educational Assistance
Relocation Assistance
Employee Discounts
Company
Oak Ridge National Laboratory
Oak Ridge National Laboratory holds a range of R&D assignments, from fundamental nuclear physics to applied R&D on advanced energy systems.
Funding
Current Stage
Late StageTotal Funding
$9.8MKey Investors
US Department of Energy
2023-09-21Grant· $4.8M
2023-07-27Grant
2022-03-14Grant· $5M
Leadership Team
Recent News
2026-01-03
2025-12-13
Company data provided by crunchbase