Apply on Employer Site

Bagel Labs · 7 hours ago

Member of Technical Staff (Infra)

Toronto, Ontario, Canada

Full-time

Onsite

Mid, Senior Level

Bagel Labs is a distributed machine learning research lab focused on open-source superintelligence. They are seeking a Member of Technical Staff to design and optimize infrastructure for training and serving large diffusion models, working at the intersection of systems engineering and performance engineering.

Artificial Intelligence (AI)Machine Learning

Hiring Manager

Bidhan R.

Responsibilities

Build and operate distributed training stacks for diffusion models (U-Net, DiT, video diffusion, world-model variants) across multi-node GPU clusters

Implement and tune parallelism strategies for training and inference, including data parallel, tensor parallel, pipeline parallel, ZeRO/FSDP-style sharding, expert parallel, and diffusion-specific tricks (timestep-level scheduling, CFG parallelism, microbatching)

Profile end-to-end GPU performance and remove bottlenecks across kernels, memory, comms, and I/O (CUDA graphs, kernel fusion, attention kernels, NCCL tuning, overlap of compute and comms)

Own inference serving for diffusion workloads with high throughput and predictable latency, including dynamic batching, variable resolution handling, caching, prefill/conditioning optimization, and multi-GPU execution

Design robust orchestration for heterogeneous and preemptible environments (on-prem, bare metal, cloud, spot), including checkpointing, resumability, and fault tolerance

Build observability that is actually useful for diffusion: step-time breakdowns, denoising throughput, VRAM headroom, NCCL health, queueing, tail latency, error budgets, and cost per sample

Implement pragmatic quantization and precision strategies for diffusion inference and training, balancing quality, speed, and stability (BF16/FP16/TF32/FP8, weight-only INT8/INT4 where it makes sense, selective quantization of submodules)

Improve developer velocity through reproducible environments, CI for performance regressions, and automation for cluster bring-up and rollouts

Write clear internal docs and occasional public technical deep-dives on blog.bagel.com when it helps the community and hiring

Qualification

GPU performance optimizationDistributed systems experienceParallelism implementationLinux fundamentalsCUDA tooling literacyDeployment pipelinesModel code modificationNetworking basicsOpen-source contributionsCost engineeringTensorRT experience

Required

Strong Linux fundamentals, networking basics, and the ability to debug production incidents without panic

Deep GPU performance instincts: profiling, memory behavior, kernel-level thinking, and practical CUDA tooling literacy (even if you are not writing CUDA daily)

Hands-on experience scaling training and/or inference across multiple GPUs and nodes

Comfort implementing parallelism and sharding in modern frameworks (PyTorch, NCCL, torch.distributed, FSDP/ZeRO-style systems, or equivalent)

Experience building reliable deployment pipelines (containers, rollouts, versioning, rollback, secrets, config management)

The ability to read model code and change it when infrastructure and performance require it

Preferred

Contributions to open-source performance or distributed systems projects (PyTorch internals, Triton kernels, xFormers/FlashAttention, NCCL tooling, Ray, Kubernetes operators, etc.)

Experience with diffusion-specific serving and optimization (Diffusers, ComfyUI, custom schedulers/solvers, distillation, few-step generation, VAE decode optimization, tiled generation)

TensorRT or compiler experience (torch.compile/Inductor, XLA, CUDA graphs), and a habit of measuring instead of guessing

Experience building multi-tenant GPU platforms with isolation, fair scheduling, and predictable QoS

Comfort with cost engineering: understanding where dollars burn in GPU clusters and how to reduce it without fragility