Doghouse Recruitment ยท 16 hours ago
AI/ML Solutions Architect
Doghouse Recruitment is seeking an AI/ML Solutions Architect to join their fast-moving AI infrastructure team focused on large-scale ML workloads. The role involves designing and validating production-grade distributed training architectures and collaborating closely with clients to optimize ML workloads across multi-node GPU environments.
Responsibilities
Design and validate production-grade distributed training (primary) and large-scale inference architectures on large GPU clusters, typically tens to thousands of GPUs
Work hands-on with customers to debug, optimize, and scale ML workloads across multi-node GPU environments
Act as a technical authority on GPU performance, networking, and schedulers, making trade-offs at scale and translating customer needs into concrete platform requirements
Collaborate closely with engineering, product, and R&D to influence roadmap decisions based on real-world ML workloads
This is a hands-on, technical role; you are expected to work directly in customer environments, not only advise at a high level
Qualification
Required
Hands-on experience designing and operating production-grade, multi-node GPU workloads for training or inference
Strong background in distributed deep learning (PyTorch Distributed, DeepSpeed) on GPU clusters
Deep understanding of GPU architecture and interconnects (H100/A100 class, NVLink, InfiniBand)
Experience with Kubernetes or Slurm and performance tuning using GPU profiling and monitoring tools
Company
Doghouse Recruitment
Recruitment for your technology teams. You don't need another agency flooding your inbox with mismatched candidates.
Funding
Current Stage
Early StageCompany data provided by crunchbase