Apply on Employer Site

Stability AI · 1 week ago

Research Scientist – VLM Generalist

United States

Full-time

Remote

Senior Level

Stability AI is seeking a Research Scientist with deep expertise in training and fine-tuning large Vision-Language and Language Models (VLMs / LLMs). The role involves designing and fine-tuning models for multimodal tasks, building training pipelines, and collaborating across teams to bring models from prototype to production.

Artificial Intelligence (AI)Generative AIImage RecognitionInformation TechnologySoftware

Responsibilities

Design and fine-tune large-scale VLMs / LLMs — and hybrid architectures — for tasks such as visual reasoning, retrieval, 3D understanding, and embodied interaction

Build robust, efficient training and evaluation pipelines (data curation, distributed training, mixed precision, scalable fine-tuning)

Conduct in-depth analysis of model performance: ablations, bias / robustness checks, and generalisation studies

Collaborate across research, engineering, and 3D / graphics teams to bring models from prototype to production

Publish impactful research and help establish best practices for multimodal model adaptation

Qualification

VLMs / LLMsMachine LearningComputer VisionPyTorchMultimodal alignment3D scene understandingDistributed trainingCommunication skillsCollaborative mindset

Required

PhD (or equivalent experience) in Machine Learning, Computer Vision, NLP, Robotics, or Computer Graphics

Proven track record in fine-tuning or training large-scale VLMs / LLMs for real-world downstream tasks

Strong engineering mindset — you can design, debug, and scale training systems end-to-end

Deep understanding of multimodal alignment and representation learning (vision–language fusion, CLIP-style pre-training, retrieval-augmented generation)

Familiarity with recent trends, including video-language and long-context VLMs, spatio-temporal grounding, agentic multimodal reasoning, and Mixture-of-Experts (MoE) fine-tuning

Awareness of 3D-aware multimodal models — using NeRFs, Gaussian splatting, or differentiable renderers for grounded reasoning and 3D scene understanding

Hands-on experience with PyTorch / DeepSpeed / Ray and distributed or mixed-precision training

Excellent communication skills and a collaborative mindset

Preferred

Experience integrating 3D and graphics pipelines into training workflows (e.g., mesh or point-cloud encoding, differentiable rendering, 3D VLMs)

Research or implementation experience with vision-language-action models, world-model-style architectures, or multimodal agents that perceive and act

Familiarity with efficient adaptation methods — LoRA, adapters, QLoRA, parameter-efficient finetuning, and distillation for edge deployment

Knowledge of video and 4D generation trends, latent diffusion / rectified flow methods, or multimodal retrieval and reasoning pipelines

Background in GPU optimisation, quantisation, or model compression for real-time inference

Open-source or publication track record in top-tier ML / CV / NLP venues

Company

Stability AI

Stability AI is an artificial intelligence company focused on developing open-source generative AI models.

Founded in 2019

London, England, GBR

51-200 employees

https://stability.ai

Funding

Current Stage

Growth Stage

Total Funding

$256M

Key Investors

WPPIntel

2025-03-05Corporate Round

2024-06-25Series Unknown· $80M

2023-11-09Convertible Note· $50M

Leadership Team

Prem Akkaraju

CEO

Hanno Basse

Chief Technology Officer

Recent News

TechWire Asia

Lenovo outlines a hybrid AI approach at CES 2026

2026-01-09

IT News Africa

Introducing Lenovo and Motorola Qira, a Personal Ambient Intelligence Designed to Work Across Devices

2026-01-07

Media Nama

Musician Files Lawsuit Against Stability AI For Using His Music For Training Despite Opt-Out Requests

2026-01-06

Company data provided by crunchbase