NVIDIA · 1 week ago
Senior Deep Learning Communication Architect
NVIDIA is a leader in computer graphics and AI innovation, seeking a Senior Deep Learning Communication Architect to enhance the performance and scalability of deep learning systems. The role involves optimizing communication protocols, collaborating with hardware and software teams, and exploring innovative technologies to improve distributed deep learning training and inference.
AI InfrastructureArtificial Intelligence (AI)Consumer ElectronicsFoundational AIGPUHardwareSoftwareVirtual Reality
Responsibilities
The software architecture group at NVIDIA has openings for a Deep Learning Communication Architect. We scale the DNN models and training/inference frameworks to systems with hundreds of thousands of nodes
Optimizing communication performance: Identify and eliminate bottlenecks in data transfer and synchronization during distributed deep learning training and inference
Designing efficient communication protocols: Develop and implement communication algorithms and protocols tailored for deep learning workloads, minimizing communication overhead and latency
Hardware and software co-craft: Collaborate with hardware and software teams to craft systems that effectively apply high-speed interconnects (e.g., NVLink, InfiniBand, SPC-X) and communication libraries (e.g., MPI, NCCL, UCX, UCC, NVSHMEM)
Exploring innovative communication technologies: Research and evaluate new communication technologies and techniques to enhance the performance and scalability of deep learning systems
Developing and implementing solutions: Build proofs-of-concept, conduct experiments, and perform quantitative modeling to validate and deploy new communication strategies
Qualification
Required
A Ph.D., Masters, or BS in Computer Science (CS), Electrical Engineering (EE), Computer Science and Electrical Engineering (CSEE), or a closely related field or equivalent experience
6+ years of experience in Building DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training and inference workloads
Experience in evaluating, analyzing, and optimizing LLM training and inference performance of state-of-the-art models on cutting-edge hardware
Deep understanding of parallelism techniques, including Data Parallelism, Pipeline Parallelism, Tensor Parallelism, Expert Parallelism, and FSDP
Understanding of the emerging serving architectures like Disaggregated Serving and inference servers like Dynamo and Triton
Proficiency in developing code for one or more deep neural network (DNN) training and Inference frameworks, such as PyTorch, TensorRT-LLM, vLLM, SGLang
Strong programming skills in C++ and Python
Familiarity with GPU computing, including CUDA and OpenCL, and familiarity with InfiniBand and RoCE networks
Preferred
Prior contributions to one or more DNN training and Inference frameworks as part of your previous work experience
Deep understanding and contributions to the scaling of LLMs on large-scale systems
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
The Motley Fool
2026-01-12
The Motley Fool
2026-01-12
Company data provided by crunchbase