Senior HPC & GPU Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Sciforium · 1 day ago

Senior HPC & GPU Infrastructure Engineer

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. They are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of their GPU compute cluster, ensuring optimal operation and maintenance of the ML software stack.

Artificial Intelligence (AI)

Responsibilities

System Health & Reliability (SRE)
On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly
Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load
Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster
Linux & Network Administration
OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets
Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure
Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre
GPU & ML Stack Engineering
Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration
Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm)
Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems
Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes)

Qualification

HPC experienceGPU cluster operationsLinux systems engineeringNVIDIA/AMD GPU expertiseNetwork securityBash scriptingPython scriptingML software stacksSoft skills

Required

5+ years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles
Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field
Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging
Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning
Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD)
Proficiency in Bash and Python for scripting, automation, and workflow tooling
Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior
Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking

Preferred

Experience with job schedulers such as Slurm, Kubernetes, or Run:AI
Exposure to vLLM, model serving optimizations, or inference systems
Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform)
Previous experience supporting ML research teams in a startup or research-heavy environment

Benefits

Medical, dental, and vision insurance
401k plan
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity

Company

Sciforium

twittertwitter
company-logo
Sciforium builds the next generation of AI models with unprecedented efficiency, privacy, and versatility.

Funding

Current Stage
Early Stage
Total Funding
$15.9M
2025-10-27Seed· $12M
2024-06-01Pre Seed· $3.9M
Company data provided by crunchbase