Member of Technical Staff (Infra) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Bagel Labs · 7 hours ago

Member of Technical Staff (Infra)

Bagel Labs is a distributed machine learning research lab focused on open-source superintelligence. They are seeking a Member of Technical Staff to design and optimize infrastructure for training and serving large diffusion models, working at the intersection of systems engineering and performance engineering.

Artificial Intelligence (AI)Machine Learning
Hiring Manager
Bidhan R.
linkedin

Responsibilities

Build and operate distributed training stacks for diffusion models (U-Net, DiT, video diffusion, world-model variants) across multi-node GPU clusters
Implement and tune parallelism strategies for training and inference, including data parallel, tensor parallel, pipeline parallel, ZeRO/FSDP-style sharding, expert parallel, and diffusion-specific tricks (timestep-level scheduling, CFG parallelism, microbatching)
Profile end-to-end GPU performance and remove bottlenecks across kernels, memory, comms, and I/O (CUDA graphs, kernel fusion, attention kernels, NCCL tuning, overlap of compute and comms)
Own inference serving for diffusion workloads with high throughput and predictable latency, including dynamic batching, variable resolution handling, caching, prefill/conditioning optimization, and multi-GPU execution
Design robust orchestration for heterogeneous and preemptible environments (on-prem, bare metal, cloud, spot), including checkpointing, resumability, and fault tolerance
Build observability that is actually useful for diffusion: step-time breakdowns, denoising throughput, VRAM headroom, NCCL health, queueing, tail latency, error budgets, and cost per sample
Implement pragmatic quantization and precision strategies for diffusion inference and training, balancing quality, speed, and stability (BF16/FP16/TF32/FP8, weight-only INT8/INT4 where it makes sense, selective quantization of submodules)
Improve developer velocity through reproducible environments, CI for performance regressions, and automation for cluster bring-up and rollouts
Write clear internal docs and occasional public technical deep-dives on blog.bagel.com when it helps the community and hiring

Qualification

GPU performance optimizationDistributed systems experienceParallelism implementationLinux fundamentalsCUDA tooling literacyDeployment pipelinesModel code modificationNetworking basicsOpen-source contributionsCost engineeringTensorRT experience

Required

Strong Linux fundamentals, networking basics, and the ability to debug production incidents without panic
Deep GPU performance instincts: profiling, memory behavior, kernel-level thinking, and practical CUDA tooling literacy (even if you are not writing CUDA daily)
Hands-on experience scaling training and/or inference across multiple GPUs and nodes
Comfort implementing parallelism and sharding in modern frameworks (PyTorch, NCCL, torch.distributed, FSDP/ZeRO-style systems, or equivalent)
Experience building reliable deployment pipelines (containers, rollouts, versioning, rollback, secrets, config management)
The ability to read model code and change it when infrastructure and performance require it

Preferred

Contributions to open-source performance or distributed systems projects (PyTorch internals, Triton kernels, xFormers/FlashAttention, NCCL tooling, Ray, Kubernetes operators, etc.)
Experience with diffusion-specific serving and optimization (Diffusers, ComfyUI, custom schedulers/solvers, distillation, few-step generation, VAE decode optimization, tiled generation)
TensorRT or compiler experience (torch.compile/Inductor, XLA, CUDA graphs), and a habit of measuring instead of guessing
Experience building multi-tenant GPU platforms with isolation, fair scheduling, and predictable QoS
Comfort with cost engineering: understanding where dollars burn in GPU clusters and how to reduce it without fragility

Benefits

A deeply technical culture where bold frontier ideas are debated, stress-tested, and built.
High autonomy and direct ownership of critical systems.
In-person role at our Toronto office.
Work that can set the direction for decentralized AI.
Paid travel opportunities to the top ML conferences around the world.

Company

Bagel Labs

twittertwittertwitter
company-logo
Open source superintelligence.

Funding

Current Stage
Early Stage
Total Funding
$3.1M
Key Investors
CoinFund
2024-01-23Pre Seed· $3.1M
Company data provided by crunchbase