Cornelis Networks · 2 months ago
AI Performance Engineer
Cornelis Networks delivers high-performance scale-out networking solutions for AI and HPC datacenters. They are seeking an AI Performance Engineer to optimize training and multi-node inference across advanced networking systems, collaborating with various teams to enhance performance for distributed AI workloads.
Artificial Intelligence (AI)Information TechnologySoftware
Responsibilities
Own end-to-end performance for distributed AI workloads (training + multi-node inference) across multi-node clusters and diverse fabrics (Omni-Path, Ethernet, InfiniBand)
Benchmark, characterize, and tune open-source & industry workloads (e.g., Llama, Mixtral, diffusion, BERT/T5, MLPerf) on current and future compute, storage, and network hardware, including vLLM/TensorRT-LLM/Triton serving paths
Design and optimize distributed serving topologies (sharded/replicated, tensor/pipe parallel, MoE expert placement), continuous/adaptive batching, KV-cache sharding/offload (CPU/NVMe) & prefix caching, and token streaming with tight p99/p999 SLOs
Optimize inferencing: Validate RDMA/GPUDirect RDMA, congestion control, and collective/point-to-point tradeoffs during inference
Design experiment plans to isolate scaling bottlenecks (collectives, kernel hot spots, I/O, memory, topology) and deliver clear, actionable deltas with latency-SLO dashboards and queuing analysis
Build crisp proof points that compare Cornelis Omni-Path to competing interconnects; translate data into narratives for sales/marketing and lighthouse customers, including cost-per-token and tokens/sec-per-watt for serving
Instrument and visualize performance (Nsight Systems, ROCm/Omnitrace, VTune, perf, eBPF, RCCL/NCCL tracing, app timers) plus serving telemetry (Prometheus/Grafana, OpenTelemetry traces, concurrency/queue depth)
Evangelize best practices through briefs, READMEs, and conference-level presentations on distributed inference patterns and anti-patterns
Qualification
Required
B.S. in CS/EE/CE/Math or related
5–7+ years running AI/ML at cluster scale
Proven ability to set up, run, and analyze AI benchmarks; deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low-latency serving
Hands-on with distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE)
Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe
Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks; Python + shell power user
Familiarity with network architectures (Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems at the performance-tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED
Excellent written and verbal communication—turn measurements into persuasion with SLO-driven narratives for inference
Preferred
M.S. in CS/EE/CE/Math or related
Scheduler expertise (SLURM, PBS) and multi-tenant cluster ops
Hands-on profiling & tracing of GPU/comm paths (Nsight Systems, Nsight Compute, ROCm tools/rocprof/roctracer/omnitrace, VTune, perf, PCP, eBPF)
Experience with NeMo, DeepSpeed, Megatron-LM, FSDP, and collective ops analysis (AllReduce/AllGather/ReduceScatter/Broadcast)
Background in HPC performance engineering or storage (BeeGFS, Lustre, NVMeoF) for data & checkpoint pipelines
Benefits
Medical, dental, and vision coverage
Disability and life insurance
Dependent care flexible spending account
Accidental injury insurance
Pet insurance
Generous paid holidays
401(k) with company match
Open Time Off (OTO)
Sick time
Bonding leave
Pregnancy disability leave
Company
Cornelis Networks
Cornelis Networks develops purpose-built fabrics for scientific, commercial, and government organizations.
H1B Sponsorship
Cornelis Networks has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (2)
2023 (1)
2022 (2)
2021 (1)
Funding
Current Stage
Growth StageTotal Funding
$93.3MKey Investors
IAG Capital PartnersDowning Ventures
2024-03-12Series B· $25M
2023-08-24Series Unknown· $19.3M
2022-11-14Series B· $29M
Leadership Team
Recent News
Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
2025-12-10
Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
2025-11-26
Company data provided by crunchbase