Apply on Employer Site

fal · 2 months ago

Staff Technical Lead for Inference & ML Performance

San Francisco

Full-time

Onsite

Lead/Staff

fal is pioneering the next generation of generative-media infrastructure, seeking a Staff Technical Lead for Inference & ML Performance. This role involves guiding a team to build and optimize high-performance inference solutions while ensuring generative models achieve best-in-class performance.

AI InfrastructureArtificial Intelligence (AI)Developer PlatformInformation TechnologyMachine Learning

Responsibilities

Set technical direction. Guide your team (kernels, applied performance, ML compilers, distributed inference) to build high-performance inference solutions

Hands-on IC leadership. Personally contribute to critical inference performance enhancements and optimizations

Collaborate closely with research & applied ML teams. Influence model inference strategies and deployment techniques

Drive advanced performance optimizations. Implement model parallelism, kernel optimization, and compiler strategies

Mentor and scale your team. Coach and expand your team of performance-focused engineers

Qualification

ML performance optimizationInference techniquesPyTorchTensorRTModel parallelismKernel optimizationCompiler strategiesCross-functional collaborationLeadership experience

Required

Deeply experienced in ML performance optimization. You've optimized inference for large-scale generative models in production environments

Understand the full ML performance stack. From PyTorch, TensorRT, TransformerEngine, Triton to CUTLASS kernels, you've navigated and optimized them all

Know inference inside-out. Expert-level familiarity with advanced inference techniques: quantization, kernel authoring, compilation, model parallelism (TP, context/sequence parallel, expert parallel), distributed serving and profiling

Lead from the front. You're a respected IC who enjoys getting hands-on with the toughest problems, demonstrating excellence to inspire your team

Thrive in cross-functional collaboration. Comfortable interfacing closely with applied ML teams, researchers, and stakeholders