Apply on Employer Site

Nebius · 4 months ago

GPU Cluster Architect

United States

Full-time

Hybrid

Senior Level

5+ years exp

Nebius is leading a new era in cloud computing to serve the global AI economy, creating tools and resources for customers to tackle real-world challenges. The GPU Cluster Architect will be responsible for designing next-generation AI infrastructure, making architectural decisions across compute, networking, and storage to meet the demands of modern AI workloads.

AI InfrastructureCloud InfrastructureGPUIaaSPaaS

Growth Opportunities

Responsibilities

Cluster Design: Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes

Performance Modeling: Analyze AI/ML workloads (e.g. LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density

Network Architecture: Align with network architect relevant design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and DC scale

Storage Integration: Work with storage teams to optimize performance for training datasets, checkpointing, and others

Reliability & Monitoring: Understand and analyze signal from monitoring systems to the detect flows in design

Collaboration: Partner with site reliability, networking, storage, and DC engineering teams to operationalize and scale your architecture

Qualification

GPU architectureHPC interconnectsCluster designSystems architectureScripting for automation

Required

5+ years of experience designing clusters

Deep understanding of modern GPU architecture (NVIDIA, AMD, etc.)

Experience with HPC interconnects (InfiniBand & RoCE)

Solid background in systems architecture, networking, and hardware reliability

Experience in scripting for automation and telemetry pipelines (Python, Go, etc.)

Benefits

Competitive salary and comprehensive benefits package.

Opportunities for professional growth within Nebius.

Hybrid working arrangements.

A dynamic and collaborative work environment that values initiative and innovation.

Company

Nebius

The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.

Founded in 2022

Amsterdam, Noord-Holland, NLD

501-1000 employees