Senior HPC & GPU Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Sciforium · 3 weeks ago

Senior HPC & GPU Infrastructure Engineer

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. They are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of their GPU compute cluster, bridging hardware operations and machine learning workflows. The role involves hands-on systems engineering, cluster monitoring, and maintaining the ML software stack.

Artificial Intelligence (AI)

Responsibilities

On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly
Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load
Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster
OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets
Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure
Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre
Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration
Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm)
Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems
Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes)

Qualification

HPC experienceGPU cluster operationsLinux systems engineeringNVIDIA/AMD GPU expertiseNetwork securityBash scriptingPython scriptingML software stacksCluster monitoringAdvanced debuggingConfiguration management

Required

5+ years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles
Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field
Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging
Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning
Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD)
Proficiency in Bash and Python for scripting, automation, and workflow tooling
Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior
Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking

Preferred

Experience with job schedulers such as Slurm, Kubernetes, or Run:AI
Exposure to vLLM, model serving optimizations, or inference systems
Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform)
Previous experience supporting ML research teams in a startup or research-heavy environment

Benefits

Medical, dental, and vision insurance
401k plan
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity

Company

Sciforium

twittertwitter
company-logo
Sciforium builds the next generation of AI models with unprecedented efficiency, privacy, and versatility.

Funding

Current Stage
Early Stage
Total Funding
$15.9M
2025-10-27Seed· $12M
2024-06-01Pre Seed· $3.9M
Company data provided by crunchbase