HPC Specialist Solutions Architect jobs in United States
cer-icon
Apply on Employer Site
company-logo

Nebius · 7 hours ago

HPC Specialist Solutions Architect

Nebius is leading a new era in cloud computing to serve the global AI economy, creating tools and resources for customers to solve real-world challenges. The HPC Specialist Solutions Architect role focuses on designing, building, and optimizing high-performance computing platforms for AI and large-scale data processing workloads.

AI InfrastructureCloud InfrastructureGPUIaaSPaaS
check
Growth Opportunities

Responsibilities

Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm)
Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper, Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE Interconnects
Deploy, and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components
Design and validate cloud HPC environments, focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling
Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling
Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies
Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers
Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews and customer engagements

Qualification

HPC architectureGPU cluster designKubernetes orchestrationNVIDIA GPU technologiesLinux systemsCI/CD practicesNetworking protocolsStorage optimizationTerraformAnsiblePython scriptingCommunication

Required

Bachelor's or Master's degree in Computer Science, Engineering, or a related field (Ph.D. a plus)
3+ years of hands-on experience architecting HPC or large-scale GPU clusters
Expertise in Linux systems, Kubernetes, container runtimes (containers, CRI-O, Docker), and related CI/CD practices
Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch)
Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage)
Familiarity with Terraform, Ansible, Helm, and GitOps workflows
Strong scripting skills in Python or Bash for automation and tool integration
Excellent communication and documentation skills; ability to lead design reviews and customer engagements

Preferred

Proficient with NVIDIA GPU ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management
Experience designing or managing AI/ML pipelines via MLflow, Kubeflow, NeMo, or similar frameworks
Experience with cloud-native HPC offerings (Slurm, LFS, PBS, etc.)
Background in designing multi-tenant GPU infrastructures or AI training farms
Exposure to distributed ML frameworks (PyTorch DDP, DeepSpeed, Megatron)
Knowledge of observability for HPC (Prometheus, DCGM Exporter, Grafana, NVIDIA NGC monitoring tools)
Contribution to open-source HPC/CUDA/Kubernetes projects is a strong plus

Benefits

Health Insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
401(k) Plan: Up to 4% company match with immediate vesting.
Parental Leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
Remote Work Reimbursement: Up to $85/month for mobile and internet.
Disability & Life Insurance: Company-paid short-term, long-term, and life insurance coverage.

Company

Nebius

twittertwittertwitter
company-logo
The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.

Funding

Current Stage
Late Stage
Total Funding
$1.04B
2025-06-04Debt Financing· $1B
2025-05-15Grant· $45M
2024-12-02Seed

Leadership Team

E
Evan Helda
Head of Physical AI
linkedin
leader-logo
Vinita Ananth
Sr. Director of Product
linkedin
Company data provided by crunchbase