Nebius · 4 months ago
GPU Cluster Architect
Nebius is leading a new era in cloud computing to serve the global AI economy, creating tools and resources for customers to tackle real-world challenges. The GPU Cluster Architect will be responsible for designing next-generation AI infrastructure, making architectural decisions across compute, networking, and storage to meet the demands of modern AI workloads.
AI InfrastructureCloud InfrastructureGPUIaaSPaaS
Responsibilities
Cluster Design: Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes
Performance Modeling: Analyze AI/ML workloads (e.g. LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density
Network Architecture: Align with network architect relevant design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and DC scale
Storage Integration: Work with storage teams to optimize performance for training datasets, checkpointing, and others
Reliability & Monitoring: Understand and analyze signal from monitoring systems to the detect flows in design
Collaboration: Partner with site reliability, networking, storage, and DC engineering teams to operationalize and scale your architecture
Qualification
Required
5+ years of experience designing clusters
Deep understanding of modern GPU architecture (NVIDIA, AMD, etc.)
Experience with HPC interconnects (InfiniBand & RoCE)
Solid background in systems architecture, networking, and hardware reliability
Experience in scripting for automation and telemetry pipelines (Python, Go, etc.)
Benefits
Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Hybrid working arrangements.
A dynamic and collaborative work environment that values initiative and innovation.
Company
Nebius
The Nebius AI Cloud brings powerful full-stack infrastructure for AI developers and practitioners across startups, enterprises and science institutes to build and deploy generative AI applications and rapidly deliver scientific breakthroughs by training and running ML models within a secure, high-performance, and cost-optimized cloud environment.
Funding
Current Stage
Late StageTotal Funding
$1.04B2025-06-04Debt Financing· $1B
2025-05-15Grant· $45M
2024-12-02Seed
Recent News
2025-10-25
Company data provided by crunchbase