HPC System Administrator jobs in United States
cer-icon
Apply on Employer Site
company-logo

Santa Clara University · 7 hours ago

HPC System Administrator

Santa Clara University is seeking a High-Performance Computing (HPC) System Administrator responsible for the design, configuration, optimization, and operation of the organization's HPC infrastructure. The role involves advanced system optimization, complex troubleshooting, and mentoring existing system administrators to enhance the team's HPC expertise.

EducationHigher EducationUniversities
check
H1B Sponsor Likelynote

Responsibilities

HPC Infrastructure Management and Optimization
Compute: Manages the entire lifecycle of all compute nodes, including procuring, installing, configuring, and maintaining hardware, operating systems, and core system software to ensure optimal performance, stability, and resource utilization for scientific workloads
Storage: Directs the management of the high-performance parallel file systems (e.g., Lustre, GPFS), NAS, and backup solutions, executing capacity planning, performance tuning, and integrity checks to guarantee secure, high-speed, and reliable data access for all users
InfiniBand: Designs, deploys, and provides expert-level troubleshooting and maintenance for the InfiniBand high-speed interconnect fabric, ensuring low-latency, high-bandwidth inter-node communication essential for scalable HPC application performance
Workload Management and System Deployment
Slurm: Administers, configures, and tunes the Slurm Workload Manager, actively managing job queues, partitions, and resource allocation policies to enforce fair-share scheduling, maximize cluster utilization, and meet diverse research computational needs
System Imaging: Develops, maintains, and updates standardized, optimized system images for all compute nodes, utilizing automation tools to facilitate rapid, consistent deployment, efficient patching, and streamlined upgrades across the cluster environment
Software Licenses: Oversees the administration and compliance of all commercial scientific software licenses, ensuring adherence to vendor agreements and strategically managing license servers and usage policies to optimize utilization and accessibility for the HPC user base
Team Development and Strategic Planning
Knowledge Transfer: Develops and implements a formal cross-training program for existing system administrators by creating documentation and delivering hands-on instruction to enhance the team's collective expertise in HPC-specific technologies (Slurm, InfiniBand, parallel file systems)
Operational Resilience: Ensures robust, shared support capabilities across the IT team by strategically transferring HPC knowledge, actively preventing single points of failure, and improving the overall efficiency and responsiveness of the operational support model
Strategic Enhancement: Contributes to the strategic planning and roadmap development for future HPC infrastructure and software enhancements by researching emerging technologies, evaluating vendor solutions, and providing expert recommendations to ensure the environment remains cutting-edge and meets long-term organizational goals
Coordination and Collaboration
Use broad expertise and unique skills to play an active role as a technical expert during the planning and implementation phases of new technologies, and participate in architecture brainstorming and design discussions with technical team members
Provide technical guidance on complex infrastructure architecture challenges to IS team members and other solution partners
Act as a role model for developing and trying different problem-solving approaches and supporting team members to do the same
Coaches and develops new team members on how to provide the best customer service
Models and supports other team members to conduct themselves with openness and honesty to enhance positive relationships based on trust, predictability, and communication
Resource Planning
Provide input on setting Enterprise Systems, and CIT, goals, objectives and strategies based on the University's mission, goals and strategic plan
Provide input in technology planning processes to develop cost-effective customer-focused solutions
Uses strong technical and organizational knowledge to plan and lead projects and working groups
Service Delivery
Work closely with the ES Manager in the creation, planning, maintenance, and secure expansion of SCU's computing infrastructure. This includes, but is not limited to, local and hosted servers, virtual appliances and devices, and storage
Work closely with ES Manager to ensure that architecture principles and standards are consistently applied across the data center compute and storage services
Collaborate with the Information Security Office (ISO) to ensure a secure and compliant enterprise environment
Work with the ISO to ensure that systems are secure and to plan for future security needs and threats
Ensure the appropriate distribution of infrastructure services to faculty, staff, and students
Create and document standards and practices regarding data center, compute and storage services for use across the University
Oversee the creation and performance of infrastructure production and test environments
Create scalable, interoperable, and flexible infrastructure solutions
Support assigned systems with on-call availability and respond within agreed upon timeframes
Analyze and evaluate processes to document and implement standard routine and process for the application of patches/updates to operating systems, applications, and hardware and firmware to ensure all physical, virtual, and hosted systems are patched with the appropriate level of security and versioning
Participate as necessary in backup operations, ensuring all required file systems and system data are successfully backed up to the appropriate media and are available off site
Participate in disaster recovery and business continuity planning
Perform daily system monitoring, verifying integrity and availability of all hardware, server resources, systems, and key processes. Check for potential problems, resource availability, capacity, performance and load characteristics, network integrity, and security threats. Monitor systems activity and usage to maintain a secure environment. Develop related solutions as warranted
Work with the CISO and system stakeholders to establish upgrade and update schedules, and maintenance windows
Keep abreast of software releases and updates, keeping all systems at current release levels as appropriate for the successful operation of the data center in support of the University
Serve as the liaison with hosted platform and third party providers to monitor service level agreements and ensure that performance expectations and requirements are met
Service Optimization
Enhance existing architecture frameworks in order to define, design, and implement simplified, standards based system architectures
Assist in the design, planning, and implementation of infrastructure systems optimization and process improvement projects
Test and assess existing infrastructure against industry standard internal and external benchmarks to ensure optimal performance and service delivery
Participate in IT and information security audits and prioritize corrective actions and successful remediation of areas supervised to ensure that continuous improvements are made on an ongoing basis
Participate in the change management process to ensure all changes to relevant services are documented, tested, deployed,and prepped for back-out strategies if necessary
Aware of industry trends and how to incorporate them with our infrastructure environment to improve services and/or cut costs
Communication
Effectively communicate complex data analyses to provide technical and strategic input during the planning phase of potential projects in the form of technical architecture designs and recommendations
Regularly communicate with Cyberinfrastructure Technologies colleagues regarding initiatives
Keep the ES Manager informed of current and potential issues, activities, operational outages, and any other risks that might jeopardize or degrade IT service delivery to the University community
Operations
Suggest operation strategies to accommodate major shifts in customer needs
Determine procedures and methods for operational tasks required to maintain data center servers and related systems in reliable, stable operation. This person will use their experience and judgment to plan and accomplish goals and objectives, and to identify potential problems and define/implement solutions
Supports academic programs by providing the necessary expertise and technical support to make faculty and student technology adoption successful, e.g., consulting with faculty launching initiatives, identifying their needs, evaluating solutions, implementing those solution
Support compute and storage needs of institutional programs by providing the necessary expertise and technical support to build a robust and highly available solution to meet their needs
Empower end users to successfully use the technology
Interface with vendors, external resources
Evaluate new software or systems under consideration for adoption
Ensure asset management procedures are maintained and documented
Work with the Enterprise Systems team to automate and streamline procedures within the department
On occasion work beyond and in addition to traditional work schedules/hours. Required to carry a cell phone and be on-call
Utilize technologies and tools to support the compute and storage infrastructure: programming, scripting, diagnostic tools
Other duties as assigned by the Manager of Enterprise Systems and IS leadership

Qualification

HPC Infrastructure ManagementInfiniBandSlurmParallel File SystemsLinux AdministrationSAN Storage ManagementConfiguration ManagementScripting LanguagesCloud ProvidersCustomer ServiceProject ManagementTeam CollaborationCommunication SkillsProblem Solving

Required

Bachelor's degree in a directly applicable field of study (Computer or Electrical Engineering, Math/Computer Science, Operations and Management Information Science)
8+ years applicable experience in the operation, maintenance, support and design of enterprise-wide computer center systems with demonstrated increasing responsibilities
2+ years of experience supporting an HPC required, including experience in Slurm or similar workload manager; InfiniBand or similar high speed interconnect; and Lustre or similar parallel file system
Knowledge of information technology, campus technology, and information security issues and trends in higher education, and ability to continually develop new knowledge regarding the same
Ability to listen and understand customer needs
Ability to plan, implement, and evaluate customer service initiatives
Ability to work in a collaborative environment, as either a member or leader of a team, to meet deadlines and achieve goals
Ability to manage a diverse workforce to provide excellent customer service
Self-motivated and shows initiative
Ability to successfully manage multiple projects simultaneously
Proven track record in project planning and project management
Ability to exercise independent judgment and engage in critical thinking and problem solving
Ability to work effectively under pressure in a busy (sometimes chaotic) and demanding information services environment
Ability to explain technical issues and policies to non-technical people
Ability to give presentations on technical issues to a broad range of audiences
Ability to foster and maintain good working relationships with faculty, administrators, students, senior management, and other leaders
Ability to handle sensitive matters with diplomacy and the ability to mediate between competing parties
Ability to maintain confidentiality and manage confidential information
Must possess impeccable integrity
Ability to speak truth to power
Appreciation for the University's mission, vision, values, priorities, procedures, and policies
Knowledgeable and experienced in large-scale computer center operations with multiple systems running Linux and Windows with Server operating systems
Experience with managing and operating SAN storage environments
Strong proficiency in the management of multi-platform hardware and software environments including Microsoft, Linux (Red Hat)
Experience with configuration management tools such as Ansible and Warewulf
Experience with Slurm and job scheduling
Strong proficiency with scripting languages (Python, Bourne Shell, Perl, etc.)
Experience with compiling software packages and managing software modules in a HPC (EasyBuild, Lmod)
Experience with racking servers and adding PCI cards
Experience with LDAP and DNS
Experience with parallel file systems
Experience with Infiniband networks
Experience with vSphere, ESXi
Experience in using and configuring system monitoring tools
Experience with enterprise Backups
Experience with cloud providers
Skilled technical troubleshooter. Must be able to analyze and solve complex problems
Knowledgeable in the use of a personal computer and standard productivity tools
Experience interacting and working with other people in a successful customer service capacity
Industry trends in enterprise infrastructure/data center technology including: automation tools, cloud technology, disaster recovery, virtualization, networking, security and other pertinent areas
Experience with Identity and Access Management (IAM)
Excellent interpersonal, written and verbal communication skills
Demonstrated ability to work in a collaborative, team environment
Strong organizational skills and ability to multitask
Must be a “self-starter” and show initiative to proactively identify and resolve problems
Must have the ability to acquire and apply new skills quickly
Strong customer service orientation
Understands the role of enterprise computing in University business processes
Works under limited supervision

Preferred

Advanced Degree preferred in directly applicable field of study or a field of management
Experience working for the needs of Higher Education or research organizations is desirable

Company

Santa Clara University

company-logo
Jesuit University in Silicon Valley

H1B Sponsorship

Santa Clara University has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2021 (1)

Funding

Current Stage
Late Stage
Total Funding
$100M
Key Investors
National Institute for Innovation in Manufacturing BiopharmaceuticalsKnight Foundation
2024-04-09Grant
2022-04-07Grant
2017-01-22Grant· $100M

Leadership Team

T
Tim Harris
Dean’s Executive Professor, Coord Contemporary Business
linkedin
Company data provided by crunchbase