Apply on Employer Site

Nebius · 7 hours ago

HPC Specialist Solutions Architect

United States

Full-time

Remote

Mid, Senior Level

$225K/yr - $315K/yr

3+ years exp

Nebius is leading a new era in cloud computing to serve the global AI economy, creating tools and resources for customers to solve real-world challenges. The HPC Specialist Solutions Architect role focuses on designing, building, and optimizing high-performance computing platforms for AI and large-scale data processing workloads.

AI InfrastructureCloud InfrastructureGPUIaaSPaaS

Growth Opportunities

Responsibilities

Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm)

Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper, Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE Interconnects

Deploy, and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components

Design and validate cloud HPC environments, focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling

Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling

Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies

Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers

Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews and customer engagements

Qualification

HPC architectureGPU cluster designKubernetes orchestrationNVIDIA GPU technologiesLinux systemsCI/CD practicesNetworking protocolsStorage optimizationTerraformAnsiblePython scriptingCommunication

Required

Bachelor's or Master's degree in Computer Science, Engineering, or a related field (Ph.D. a plus)

3+ years of hands-on experience architecting HPC or large-scale GPU clusters

Expertise in Linux systems, Kubernetes, container runtimes (containers, CRI-O, Docker), and related CI/CD practices

Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch)

Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage)

Familiarity with Terraform, Ansible, Helm, and GitOps workflows

Strong scripting skills in Python or Bash for automation and tool integration

Excellent communication and documentation skills; ability to lead design reviews and customer engagements

Preferred

Proficient with NVIDIA GPU ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, and CUDA stack management

Experience designing or managing AI/ML pipelines via MLflow, Kubeflow, NeMo, or similar frameworks

Experience with cloud-native HPC offerings (Slurm, LFS, PBS, etc.)

Background in designing multi-tenant GPU infrastructures or AI training farms

Exposure to distributed ML frameworks (PyTorch DDP, DeepSpeed, Megatron)

Knowledge of observability for HPC (Prometheus, DCGM Exporter, Grafana, NVIDIA NGC monitoring tools)

Contribution to open-source HPC/CUDA/Kubernetes projects is a strong plus

Benefits

Health Insurance: 100% company-paid medical, dental, and vision coverage for employees and families.

401(k) Plan: Up to 4% company match with immediate vesting.

Parental Leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.

Remote Work Reimbursement: Up to $85/month for mobile and internet.

Disability & Life Insurance: Company-paid short-term, long-term, and life insurance coverage.

Company

Nebius

The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.

Founded in 2022

Amsterdam, Noord-Holland, NLD

501-1000 employees

https://nebius.com/

Funding

Current Stage

Late Stage

Total Funding

$1.04B

2025-06-04Debt Financing· $1B

2025-05-15Grant· $45M

2024-12-02Seed

Leadership Team

Evan Helda

Head of Physical AI

Vinita Ananth

Sr. Director of Product

Recent News

Business Wire

Nebius Debuts the Robotics & Physical AI Awards and Summit to Support Next-Generation Startups With $1.5 Million in Compute Credits

2025-12-10

GeekWire

Tech Moves: Expedia names first AI chief; Textio founder joins Microsoft; T-Mobile exec departs

2025-12-02

WebProNews

Applied Digital’s $5B AI Lease: North Dakota’s Compute Boom

2025-10-25

Company data provided by crunchbase