RDMA Ops Engineer - Computing Infrastructure Networking-Sunnyvale jobs in United States
cer-icon
Apply on Employer Site
company-logo

Alibaba Cloud · 6 hours ago

RDMA Ops Engineer - Computing Infrastructure Networking-Sunnyvale

Alibaba Cloud is seeking a skilled RDMA Ops Engineer to optimize and maintain high-performance networking infrastructure for their computing clusters. This role focuses on building and operating ultra-low latency, high-throughput networks using RDMA technologies to power next-generation computing workloads.

Cloud Data ServicesCloud ManagementData CenterData ManagementFoundational AISoftware
check
H1B Sponsor Likelynote

Responsibilities

Deploy, operate and maintain RDMA-based network architectures (RoCE/InfiniBand) for cluster with thousands of nodes
Optimize network performance for distributed collective communication workloads (NCCL, MPI, etc.)
Solve complex network issues in distributed collective communication (e.g., NCCL/MPI communication bottlenecks)
Use automation tools for network provisioning, monitoring, diagnostics,and network performance profiling (latency/throughput analysis)
Implement CI/CD pipelines for network infrastructure-as-code
Manage end-to-end network lifecycle: deployment, configuration, monitoring, upgrades
Collaborate with computing algorithm engineers to troubleshoot network-related bottlenecks in training/inference pipelines
Bridge Computing framework requirements with underlying network infrastructure capabilities
Ensure compliance with security and scalability requirements

Qualification

RDMA operational experienceLinux network stack tuningNetwork protocols knowledgeScripting skillsPerformance tuningKubernetes networkingComplex technical abstractionCommunication skills

Required

Strong scripting skills (Python/Go/Bash) for operational automation
Expert-level RDMA operational experience (RoCEv2/InfiniBand)
Understanding of Linux internals (kernel bypass, syscall optimization, etc) and proficient in Linux network stack tuning (irqbalance, NUMA, hugepages)
Hands-on experience with RDMA/DPDK performance tuning
Strong knowledge of network protocols (TCP/IP, RoCEv2) and NIC architecture principles
Ability to abstract complex technical concepts into architectural diagrams
Proven track record of translating R&D innovations into production solutions
Strong communication skills for cross-functional collaboration with Computing researchers and SRE teams

Preferred

Have experience on managing production Computing networks
Familiar with Kubernetes networking (CNI, Multus, SR-IOV) and GPU-aware scheduling
Background in Computing system optimization (NVIDIA collective libraries, MPI tuning)
Deep understanding of Computing workload patterns and their network implications

Company

Alibaba Cloud

twittertwittertwitter
company-logo
Alibaba Cloud develops cloud computing and data management services. It is a sub-organization of Alibaba Group.

H1B Sponsorship

Alibaba Cloud has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (18)
2024 (14)
2023 (2)
2022 (1)

Funding

Current Stage
Late Stage
Total Funding
$1.2B
Key Investors
Alibaba Group
2015-07-29Series B· $1B
2012-09-20Series A· $200M
Company data provided by crunchbase