Apply on Employer Site

Sciforium · 1 day ago

Senior HPC & GPU Infrastructure Engineer

San Francisco, CA

Full-time

Onsite

Senior Level

$190K/yr - $250K/yr

5+ years exp

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. They are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of their GPU compute cluster, ensuring optimal operation and maintenance of the ML software stack.

Artificial Intelligence (AI)

Responsibilities

System Health & Reliability (SRE)

On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly

Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load

Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster

Linux & Network Administration

OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets

Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure

Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre

GPU & ML Stack Engineering

Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration

Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm)

Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems

Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes)

Qualification

HPC experienceGPU cluster operationsLinux systems engineeringNVIDIA/AMD GPU expertiseNetwork securityBash scriptingPython scriptingML software stacksSoft skills

Required

5+ years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles

Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field

Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging

Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning

Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD)

Proficiency in Bash and Python for scripting, automation, and workflow tooling

Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior

Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking

Preferred

Experience with job schedulers such as Slurm, Kubernetes, or Run:AI

Exposure to vLLM, model serving optimizations, or inference systems

Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform)

Previous experience supporting ML research teams in a startup or research-heavy environment

Benefits

Medical, dental, and vision insurance

401k plan

Daily lunch, snacks, and beverages

Flexible time off

Competitive salary and equity

Company

Sciforium

Sciforium builds the next generation of AI models with unprecedented efficiency, privacy, and versatility.

Founded in 2024

San Francisco, California, USA

2-10 employees

https://sciforium.com

Funding

Current Stage

Early Stage

Total Funding

$15.9M

2025-10-27Seed· $12M

2024-06-01Pre Seed· $3.9M

Company data provided by crunchbase