GPU Cluster Architect jobs in United States
cer-icon
Apply on Employer Site
company-logo

Nebius · 4 months ago

GPU Cluster Architect

Nebius is leading a new era in cloud computing to serve the global AI economy, creating tools and resources for customers to tackle real-world challenges. The GPU Cluster Architect will be responsible for designing next-generation AI infrastructure, making architectural decisions across compute, networking, and storage to meet the demands of modern AI workloads.

AI InfrastructureCloud InfrastructureGPUIaaSPaaS
check
Growth Opportunities

Responsibilities

Cluster Design: Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes
Performance Modeling: Analyze AI/ML workloads (e.g. LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density
Network Architecture: Align with network architect relevant design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and DC scale
Storage Integration: Work with storage teams to optimize performance for training datasets, checkpointing, and others
Reliability & Monitoring: Understand and analyze signal from monitoring systems to the detect flows in design
Collaboration: Partner with site reliability, networking, storage, and DC engineering teams to operationalize and scale your architecture

Qualification

GPU architectureHPC interconnectsCluster designSystems architectureScripting for automation

Required

5+ years of experience designing clusters
Deep understanding of modern GPU architecture (NVIDIA, AMD, etc.)
Experience with HPC interconnects (InfiniBand & RoCE)
Solid background in systems architecture, networking, and hardware reliability
Experience in scripting for automation and telemetry pipelines (Python, Go, etc.)

Benefits

Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Hybrid working arrangements.
A dynamic and collaborative work environment that values initiative and innovation.

Company

Nebius

twittertwittertwitter
company-logo
The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.

Funding

Current Stage
Late Stage
Total Funding
$1.04B
2025-06-04Debt Financing· $1B
2025-05-15Grant· $45M
2024-12-02Seed

Leadership Team

E
Evan Helda
Head of Physical AI
linkedin
leader-logo
Vinita Ananth
Sr. Director of Product
linkedin
Company data provided by crunchbase