Nebius · 7 hours ago
HPC Specialist Solutions Architect
Nebius is leading a new era in cloud computing to serve the global AI economy, creating tools and resources for customers to solve real-world challenges. The HPC Specialist Solutions Architect role focuses on designing, building, and optimizing high-performance computing platforms for AI and large-scale data processing workloads.
AI InfrastructureCloud InfrastructureGPUIaaSPaaS
Responsibilities
Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm)
Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper, Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE Interconnects
Deploy, and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components
Design and validate cloud HPC environments, focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling
Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling
Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies
Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers
Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews and customer engagements
Qualification
Required
Bachelor's or Master's degree in Computer Science, Engineering, or a related field (Ph.D. a plus)
3+ years of hands-on experience architecting HPC or large-scale GPU clusters
Expertise in Linux systems, Kubernetes, container runtimes (containers, CRI-O, Docker), and related CI/CD practices
Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch)
Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage)
Familiarity with Terraform, Ansible, Helm, and GitOps workflows
Strong scripting skills in Python or Bash for automation and tool integration
Excellent communication and documentation skills; ability to lead design reviews and customer engagements
Preferred
Proficient with NVIDIA GPU ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management
Experience designing or managing AI/ML pipelines via MLflow, Kubeflow, NeMo, or similar frameworks
Experience with cloud-native HPC offerings (Slurm, LFS, PBS, etc.)
Background in designing multi-tenant GPU infrastructures or AI training farms
Exposure to distributed ML frameworks (PyTorch DDP, DeepSpeed, Megatron)
Knowledge of observability for HPC (Prometheus, DCGM Exporter, Grafana, NVIDIA NGC monitoring tools)
Contribution to open-source HPC/CUDA/Kubernetes projects is a strong plus
Benefits
Health Insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
401(k) Plan: Up to 4% company match with immediate vesting.
Parental Leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
Remote Work Reimbursement: Up to $85/month for mobile and internet.
Disability & Life Insurance: Company-paid short-term, long-term, and life insurance coverage.
Company
Nebius
The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.
Funding
Current Stage
Late StageTotal Funding
$1.04B2025-06-04Debt Financing· $1B
2025-05-15Grant· $45M
2024-12-02Seed
Recent News
2025-10-25
Company data provided by crunchbase