Embedding VC · 1 month ago
Member of Technical Staff - ML Infrastructure & Performance
Moonlake is focused on AI for creating real-time interactive content, and they are seeking a Member of Technical Staff to improve throughput, latency, and cost of their models. The role involves optimizing GPU performance, managing serving stacks, and ensuring system scalability.
Artificial Intelligence (AI)Impact Investing
Responsibilities
GPU performance: CUDA/Triton kernels, FlashAttention family, paged attention, CUDA Graphs
Serving stack: TensorRT-LLM/Triton Inference Server, vLLM/TGI; continuous batching; on-GPU KV reuse; speculative decoding/medusa; mixture-of-agents routing
Parallelism: FSDP/ZeRO, TP/PP/expert parallel; NCCL tuning
Quantization/PEFT: AWQ/GPTQ/FP8; LoRA/DoRA serving
Systems: Ray/k8s/Argo, observability (Prom/Grafana/OpenTelemetry), autoscaling, A/B infra, canary + rollback
Qualification
Required
Experience with GPU performance: CUDA/Triton kernels, FlashAttention family, paged attention, CUDA Graphs
Experience with serving stack: TensorRT-LLM/Triton Inference Server, vLLM/TGI; continuous batching; on-GPU KV reuse; speculative decoding/medusa; mixture-of-agents routing
Experience with parallelism: FSDP/ZeRO, TP/PP/expert parallel; NCCL tuning
Experience with quantization/PEFT: AWQ/GPTQ/FP8; LoRA/DoRA serving
Experience with systems: Ray/k8s/Argo, observability (Prom/Grafana/OpenTelemetry), autoscaling, A/B infra, canary + rollback
Previous experience at Infra-heavy startups such as Databricks, Roblox