Apply on Employer Site

Cadre5 · 1 week ago

Senior Linux HPC Storage Engineer

Knoxville, TN

Full-time

Hybrid

Senior Level, Lead/Staff

8+ years exp

Cadre5 is a company that provides innovative technical solutions, partnering with the Information Technology Services Directorate at Oak Ridge National Laboratory. They are seeking a Senior Linux HPC Storage Engineer to design, operate, and maintain large-scale HPC storage systems and ensure the performance and security of production storage environments.

ComputerSoftware

Growth Opportunities

No H1B

Security Clearance Required

U.S. Citizen Only

Responsibilities

Architect, deploy, and manage large-scale HPC storage systems, including parallel file systems such as Lustre, GPFS/Spectrum Scale, BeeGFS and WEKA

Design, implement, and operate large-scale Ceph storage clusters for HPC and research workloads, delivering reliable, high-performance object, block, and file storage services

Ensure the availability, performance, scalability, and security of production storage environments

Administer and optimize enterprise storage platforms such as Qumulo and NetApp in support of HPC and research workloads

Design, deploy, and maintain archival storage solutions including Spectra Logic BlackPearl and large-scale tape libraries to ensure long-term data preservation and accessibility

Integrate high-performance, enterprise, and archival storage layers into cohesive tiered storage architectures that balance cost, scalability, and performance for diverse scientific workflows

Leverage automation and monitoring solutions to minimize day-to-day maintenance while identifying opportunities to optimize system performance and management

Collaborate with researchers and technical POCs to support large data workflows and optimize I/O performance for scientific workloads

Automate storage provisioning, monitoring, and maintenance using scripting and configuration management tools

Diagnose and resolve complex storage and I/O-related issues in high-throughput, low-latency HPC environments

Evaluate emerging storage technologies (NVMe, object storage, hierarchical storage management, burst buffers) and contribute to strategic planning for future HPC systems

Work with 24/7 operations staff to streamline monitoring and troubleshooting, significantly reducing the need for off-hours support

Deliver ORNL’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote equal opportunity by fostering a respectful workplace

Qualification

HPC storage managementLinux administrationConfiguration managementScripting languagesParallel file systemsStorage networkingPerformance monitoringCollaborationDocumentation skills

Required

A BS degree in computer science, computer engineering, information technology, information systems, science, engineering, or related discipline and 8–12 years of relevant professional experience; or an equivalent combination of education and experience

Master's degree holders: 7–10 years of relevant experience

PhD holders: 4–6 years of relevant experience

Five (5) or more years managing UNIX/Linux systems

Demonstrated experience managing HPC storage and large-scale enterprise storage systems

Three (3) or more years working with configuration management and automation tools such as Git, Jenkins, Ansible, or Puppet

Proficiency with at least one scripting language (Bash, Python, Perl, etc.)

Strong Linux administration and advanced troubleshooting experience

Experience supporting large data systems and/or HPC scientific workloads

Strong desire to innovate and evaluate new technologies for HPC and storage environments

Collaborative approach and ability to become a trusted advisor to research teams

The ability to obtain and maintain a Department of Energy 'Q' clearance is required. This requires US Citizenship

Preferred

Active DOE Q, DoD Top Secret, or TS/SCI clearance is strongly preferred

Solid understanding of multiple operating systems and HPC cluster technologies

Experience with Rocky/CentOS/RHEL, Ubuntu, VMware

Understanding of HPC job schedulers (SLURM) and user support workflows

Experience with container technologies in HPC environments

Experience with multiple system deployment mechanisms (Warewulf, PXEboot, Cobbler, Bright)

Experience with GPU clusters (NVIDIA, AMD) for AI/ML and scientific workloads

Deep expertise with high-performance parallel file systems (Lustre, GPFS/Spectrum Scale, BeeGFS, WEKA)

Knowledge of storage networking (Infiniband, NVMe-oF, SAN/NAS architectures)

Familiarity with RAID, ZFS, and object storage technologies

Strong background in performance monitoring, benchmarking, and I/O optimization

Experience with monitoring systems such as Grafana, CheckMK, Nagios, Zabbix, Ganglia

Previous experience working in a government, scientific, or other highly technical environment

Strong documentation skills and ability to prepare web-based documentation

Benefits

3 weeks’ vacation

Excellent medical insurance, including employer-paid benefits

Full medical, dental, and vision coverage coupled with 401K match

15 days PTO

10 holidays

Company

Cadre5

cadre5 is dedicated to building great software.

Founded in 1999

Knoxville, Tennessee, USA

51-200 employees

https://www.cadre5.com/#featured

Funding

Current Stage

Growth Stage

Leadership Team

Steve Hicks

President / CEO

Chris O'Neal

Sr Partner / VP Software Eng

Company data provided by crunchbase