SageCor Solutions · 3 weeks ago
Senior HPC Engineer/Administrator (IMC - 001)
SageCor Solutions is a growing company specializing in engineering services and high performance computing. The Senior HPC Engineer/Administrator will be responsible for managing and providing technical support for HPC systems, ensuring system integrity, and optimizing operations in a research-driven environment.
HardwareInformation TechnologySoftware
Responsibilities
Configure and manage Linux and Windows (or other applicable) operating systems and installs/loads operating system software, troubleshoot, maintain integrity of and configure network components, along with implementing operating systems enhancements to improve security, reliability, and performance
Administer, monitor, and maintain HPC systems, including compute nodes, storage, networking, and software stacks
Provide support to IT systems including day-to-day operations, monitoring and problem resolution for all of the client/server/storage/network devices, mobile devices, etc
Implement and maintain automation tools for system provisioning, configuration management, and monitoring
Provide support for implementation, troubleshooting and maintenance of IT systems
Manage the daily activities of configuration and operation of IT systems
Provide assistance to users in accessing and using IT systems
Optimize system operations and resource utilization, and perform system capacity analysis and planning
Provide in-depth experience in trouble-shooting IT systems
Analyze and resolve complex problems associated with server hardware, applications and software integration
Contribute to performance benchmarking, system tuning, and capacity planning
Support researchers by providing technical expertise and resolving IT-related roadblocks or issues
Document system administration procedures and contribute to knowledge-sharing initiatives
Qualification
Required
Active TS/SCI W/ Polygraph Required
Configure and manage Linux and Windows (or other applicable) operating systems and installs/loads operating system software, troubleshoot, maintain integrity of and configure network components, along with implementing operating systems enhancements to improve security, reliability, and performance
Administer, monitor, and maintain HPC systems, including compute nodes, storage, networking, and software stacks
Provide support to IT systems including day-to-day operations, monitoring and problem resolution for all of the client/server/storage/network devices, mobile devices, etc
Implement and maintain automation tools for system provisioning, configuration management, and monitoring
Provide support for implementation, troubleshooting and maintenance of IT systems
Manage the daily activities of configuration and operation of IT systems
Provide assistance to users in accessing and using IT systems
Optimize system operations and resource utilization, and perform system capacity analysis and planning
Provide in-depth experience in trouble-shooting IT systems
Analyze and resolve complex problems associated with server hardware, applications and software integration
Contribute to performance benchmarking, system tuning, and capacity planning
Support researchers by providing technical expertise and resolving IT-related roadblocks or issues
Document system administration procedures and contribute to knowledge-sharing initiatives
Experience administering Linux-based servers and HPC clusters, including job schedulers (e.g., Slurm, LSF, PBS)
Experience configuring and managing Virtual Private Network (VPN) clients and servers
Scripting/programming skills (C and Python)
Knowledge of: System automation tools (e.g., Ansible)
Knowledge of: System provisioning tools (e.g., Warewolf)
Knowledge of: Distributed storage systems (e.g., Lustre, BeeGFS)
Knowledge of: Containerization (e.g., Docker, Apptainer)
Knowledge of: Installing, maintaining and using infrastructure and performance monitoring and optimization tools (e.g., Grafana, Prometheus)
Knowledge of: Setting up and executing benchmarks in an HPC environment and analyzing their results systematically
Preferred
Preferably meets DoD 8140.01 or DoD 8570.01-M training and certification requirements