Doghouse Recruitment · 5 hours ago
HPC Infrastructure Solution Architect
Doghouse Recruitment is seeking an HPC Infrastructure Solutions Architect to join their AI infrastructure team, focusing on building GPU, networking, and storage platforms for large-scale AI training workloads. The role involves designing and operating production-grade GPU and HPC platforms, ensuring quality, scalability, and efficiency of the infrastructure.
Responsibilities
Design and operate production-grade GPU and HPC platforms for AI/ML training and simulation
Build and scale GPU clusters, with a strong focus on Slurm-based scheduling
Design and optimize high-performance networking using RDMA, InfiniBand, NVLink, and NVSwitch
Design and tune storage and I/O paths for large-scale datasets
Build cloud infrastructure using open-source tooling such as Kubernetes, Terraform, and Helm
Qualification
Required
Hands-on experience building and operating GPU or HPC clusters
Strong Linux, Kubernetes, networking, and storage background
Deep understanding of HPC networking and RDMA stacks
Experience with GPU schedulers, preferably Slurm
Strong cloud experience, ideally multi-cloud
Strong storage and I/O expertise
Preferred
Experience with specific storage technologies
Company
Doghouse Recruitment
Recruitment for your technology teams. You don't need another agency flooding your inbox with mismatched candidates.
Funding
Current Stage
Early StageCompany data provided by crunchbase