Runpod · 8 hours ago
Manager, Datacenter Network Engineering
Runpod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full‑stack AI applications. The Engineering Manager, Datacenter Network Engineering will lead a team responsible for designing, deploying, and operating Runpod's global datacenter and backbone network, focusing on network architecture and team leadership.
AI InfrastructureArtificial Intelligence (AI)Cloud InfrastructureGPU
Responsibilities
Lead the Datacenter Networking Team: Manage and grow a team of network engineers responsible for datacenter fabrics, interconnects, and global WAN connectivity. Provide mentorship, technical guidance, and clear ownership boundaries
Own Datacenter Network Architecture: Define and evolve network designs for GPU-heavy clusters, including spine-leaf topologies, ECMP routing, and high-bandwidth east-west traffic patterns
High-Performance GPU Networking: Oversee design and operation of InfiniBand and RoCE-based fabrics supporting distributed training and inference workloads. Ensure performance, loss characteristics, and congestion control meet AI workload requirements
Encapsulation & Overlay Protocols: Guide implementation and operations of encapsulation technologies such as VXLAN, EVPN, Geneve, or similar, enabling scalable multi-tenant isolation and flexible network provisioning
Global WAN & Backbone Connectivity: Lead strategy and execution for global WAN connectivity, including private backbone links, IX connectivity, and hybrid connectivity with cloud providers and partners
Reliability & Operations: Establish operational best practices for monitoring, capacity planning, change management, incident response, and post-mortems across the network stack
Cross-Functional Collaboration: Partner closely with Infrastructure, SRE, Hardware, and Product Engineering teams to ensure network capabilities align with platform and customer requirements
Vendor & Partner Management: Work with hardware vendors, colocation providers, and transit partners on network design, procurement, deployment timelines, and escalations
Security & Segmentation: Ensure network designs support secure isolation, DDoS resilience, and compliance requirements without compromising performance
Qualification
Required
Engineering Leadership Experience: 3+ years managing network or infrastructure engineering teams, with experience scaling teams and systems in production environments
Datacenter Networking Expertise: 8+ years designing and operating large-scale datacenter networks, including spine-leaf architectures, BGP-based routing, and high-throughput fabrics
Encapsulation & Overlays: Strong hands-on experience with VXLAN/EVPN or equivalent encapsulation protocols, including control-plane and data-plane considerations
High-Performance Networking: Proven experience with InfiniBand and/or RoCE, including congestion management, lossless Ethernet concepts, and performance tuning for GPU workloads
Global WAN Experience: Deep familiarity with global WAN technologies, including private backbone design, inter-region connectivity, routing policy, and traffic engineering
Linux & Network OS Fluency: Comfortable working with Linux-based systems, network operating systems, and automation tooling
Operational Excellence: Strong background in network observability, incident management, capacity forecasting, and change control
Communication & Leadership: Clear written and verbal communication skills, with the ability to align stakeholders and lead teams through complex technical challenges
Successful completion of a background check
Preferred
Experience operating networks for GPU clusters, HPC environments, or AI/ML platforms
Familiarity with RDMA tuning, NCCL traffic patterns, and distributed training communication models
Experience with automation frameworks and network-as-code (e.g., Terraform, Ansible, internal tooling)
Background in multi-region or multi-cloud networking architectures
Experience working in high-growth or hyperscale infrastructure environments
Benefits
Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
Generous medical, dental & vision plans — we cover 100% for all employees and partial for dependents.
Flexible PTO- take the time you need to recharge
Company
Runpod
Runpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications.
Funding
Current Stage
Growth StageTotal Funding
$22MKey Investors
Dell Technologies Capital,Intel Capital
2024-05-08Seed· $20M
2023-03-30Pre Seed· $2M
Recent News
Bizjournals.com Feed (2025-11-12 15:43:17)
2026-02-05
PR Newswire
2026-01-20
Company data provided by crunchbase