Senior HPC Cluster Systems Administrator jobs in United States
info-icon
This job has closed.
company-logo

Berkeley Lab · 4 weeks ago

Senior HPC Cluster Systems Administrator

Berkeley Lab’s Information Technology Division is seeking a Senior HPC Cluster Systems Administrator to support the research community by maintaining Linux-based resources and high-performance computing cluster systems. The role involves extensive expertise in High Performance Computing infrastructure to facilitate groundbreaking research globally.

Research
badNo H1Bnote

Responsibilities

Perform Linux system and HPC cluster maintenance and installations, operating system upgrades, system security hardening and intrusion detection, storage and file system management, system hardware, customization of user group working environment, troubleshooting, network monitoring, and crash recovery
Design, deploy, and manage scalable applications using Kubernetes, ensuring the availability, performance, and readiness of the Kubernetes infrastructure
Automate deployment, scaling, and management of containerized applications, and collaborating with DevOps and development teams to streamline CI/CD pipelines
Design, deploy, and manage the global storage platform to ensure high performance, massive scalability, reliability, and future-proof solutions
Support storage technologies such as Lustre, VAST, and networks
Resolve I/O issues related to business applications, including diagnosing and resolving complex storage, Linux, and networking challenges in a fast-paced environment
Research new storage management technologies, techniques, and provide recommendations
Participate in developing system administration, security, and network policies, documentation, and tools oriented towards efficient systems management
Participate in cluster support to staff and researchers, including initial installation, integration, and ongoing maintenance of Linux High-Performance Computing cluster systems. This includes travel to remote sites if as needed
Co-leading technical efforts with other senior system administrators in areas of HPC technologies such as job schedulers, high-performance interconnects, parallel file systems, cybersecurity, cluster management, container orchestration, VM infrastructure, networking, performance tuning, or data center planning
Co-leading group projects of small to medium size and complexity, to implement and deploy new computing technologies and associated services to the research community

Qualification

Linux system administrationHigh Performance ComputingKubernetes managementStorage system designPythonBashCI/CD toolsRed Hat derivativesInterpersonal skillsCommunication skillsProject management

Required

A Bachelor's Degree (or equivalent knowledge/training) in Computer Science, Engineering, or a related discipline, and a minimum of 12 years of relevant experience in Linux system administration within a large distributed computing environment, including experience providing systems and end-user support for multiple scientific or computational research groups or an equivalent combination of education and experience
Demonstrated ability to manage large-scale, performance-critical environments, including capacity planning, scaling, and optimization
Significant experience deploying, scaling, and managing Kubernetes clusters, with a strong understanding of its architecture (pods, deployments, services, ingress) and container orchestration. Proven proficiency with CI/CD tools like Jenkins or GitLab CI
Proven experience with Red Hat derivatives (CentOS, Scientific Linux, Rocky Linux), Debian, Ubuntu, and large-scale system and configuration management tools (Kickstart, Ansible, Puppet, Chef, Warewulf). Expertise in supporting standard services (NFS, LDAP, SMB, MySQL, Apache/Nginx HTTPD)
Strong HPC expertise, including Linux, job schedulers, high-performance interconnects, parallel file systems, cybersecurity, container orchestration, cluster management, VM infrastructure, networking, performance tuning, scientific application support, and data center planning
Proficiency in Python and Bash for building, optimizing, and debugging scientific codes (C, C++, Fortran, Java), including experience with compilers (GCC, Intel), debuggers, Makefiles, and version-control (git, Subversion)
Expertise in storage system design and optimization (Lustre, S3, VAST, Weka, Ceph, DDN), including a deep understanding of the storage stack (kernel to user space, including file systems, block storage, I/O schedulers, VFS), storage benchmarking, and performance tuning (throughput, latency, IOPS, workload-specific optimizations)
Excellent oral and written communication skills including experience organizing and presenting customer focused technical data, reports, and projects to audiences with varying degrees of technical expertise
Strong interpersonal skills including experience with research facilitation and project management in a multidisciplinary team environment

Preferred

An Advanced Degree (or equivalent knowledge/training) in Computer Science, Engineering, or a related discipline
Experience with software engineering and/or software development
Familiarity with Kubernetes-related tools like Helm, Istio, and Prometheus
Demonstrated experience supporting research at a National Lab and/or in an academic or research environment

Benefits

Exceptional health and retirement benefits, including pension or 401K-style plans
Opportunities to grow in your career - check out our Tuition Assistance Program
A culture where you’ll belong - we are invested in our teams!
In addition to accruing vacation and sick time, we also have an annual Winter Holiday Shutdown
Parental bonding leave (for both mothers and fathers)
Pet insurance

Company

Berkeley Lab

twittertwittertwitter
company-logo
Berkeley Lab is a national laboratory that creates advanced new tools for scientific discovery.

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Mary Barnum, MBA
Business Manager, COO Office
linkedin
leader-logo
Rebecca Rishell
Deputy Chief Operating Officer
linkedin
Company data provided by crunchbase