Systems Analyst (/Site Reliability Engineer) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Hewlett Packard Enterprise · 3 months ago

Systems Analyst (/Site Reliability Engineer)

Hewlett Packard Enterprise is a global edge-to-cloud company that helps organizations manage their data and applications. They are seeking a skilled Systems Analyst (/Site Reliability Engineer) to support Oak Ridge National Laboratory, focusing on the deployment, maintenance, and optimization of high-performance computing systems for scientific research.

Data CenterEnterprise SoftwareInformation TechnologyIT ManagementNetwork Security
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Maintain and optimize compute infrastructure across multiple large-scale HPC systems
Participate in the deployment, testing, and validation of live high-performance computing clusters
Troubleshoot node failures by analyzing OS internals, compiler behavior, and system logs, coordinating with internal subject-matter experts as needed
Conduct routine and on-demand maintenance, troubleshooting, and performance tuning for large-scale HPC environments
Collaborate with researchers, engineers, and technical staff to open, maintain and close JIRA tickets to ensure system reliability and efficiency for high-stakes, high-performance scientific research
Investigate and document complex software and system-level issues, acting as a bridge between users and HPE internal teams
Develop and implement automation tools, scripts, and monitoring solutions to streamline system management
Stay up-to-date with advancements in HPC technologies, including GPU acceleration (e.g., ROCm), parallel computation (Cray PE, MPI/OpenMP), and performance tuning

Qualification

HPC systems experienceLinux proficiencyPython programmingBash scriptingC++/Fortran familiarityLog analysisApplication build knowledgeIndustry knowledgeCommunication skills

Required

Due to the nature of the work, this position requires either U.S. Citizenship or U.S. Lawful Permanent Resident (LPR) status
Experience using SLURM-based HPC systems, both as a user and preferably as a system administrator
Proficient in Linux, Python, and Bash scripting
Familiarity with C++/Fortran-based HPC application development, GPUs, MPI, and high-performance computing tools
Strong understanding of application build processes, including compiler configurations, library integration, and dependency management, to effectively set up environments, perform upgrades, and troubleshoot build and runtime issues
Experience in large-scale log analysis and troubleshooting performance, bugs or system failures
Strong written and verbal communication skills, with the ability to document and share knowledge effectively with internal teams and end-users
Familiarity with emerging HPC trends, system architectures, and optimization strategies
Bachelor's in Computer Science, Computer Engineering, or a related field, with at least 2 years of experience, OR a Master's in Computer Science or Computer Engineering of a related field

Benefits

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Company

Hewlett Packard Enterprise

twittertwittertwitter
company-logo
Hewlett Packard Enterprise is an edge-to-cloud company that uses comprehensive solutions to accelerate business outcomes.

Funding

Current Stage
Public Company
Total Funding
$2.85B
Key Investors
Elliott Management Corp.
2025-04-15Post Ipo Equity· $1.5B
2024-09-10Post Ipo Equity· $1.35B
2015-11-02IPO

Leadership Team

leader-logo
Antonio Neri
President & CEO
linkedin
leader-logo
Fidelma Russo
EVP & GM, Hybrid Cloud and Chief Technology Officer
linkedin
Company data provided by crunchbase