Apply on Employer Site

Foundation · 1 day ago

Vision Language Action (VLA) Models Engineer

San Francisco Bay Area

Full-time

Onsite

Mid, Senior Level

$150K/yr - $300K/yr

Foundation is developing the future of general purpose robotics with the goal to address the labor shortage. They are seeking a Vision Language Action (VLA) Models Engineer to develop and optimize vision-language-action models and integrate them with real-time robot control stacks.

Artificial Intelligence (AI)Machine LearningRobotics

H1B Sponsor Likely

Hiring Manager

Jordi Vidal

Responsibilities

Develop and optimize vision-language-action models, including transformers, diffusion models, and multimodal encoders/decoders

Build representations for 2D/3D perception, affordances, scene understanding, and spatial reasoning

Integrate LLM-based reasoning with action planning and control policies

Design datasets for multimodal learning: video-action trajectories, instruction following, teleoperation data, and synthetic data

Interface VLAM outputs with real-time robot control stacks (navigation, manipulation, locomotion)

Implement grounding layers that convert natural language instructions into symbolic, geometric, or skill-level action plans

Deploy models on on-board or edge compute platforms, optimizing for latency, safety, and reliability

Build scalable pipelines for ingesting, labeling, and generating multimodal training data

Create simulation-to-real (Sim2Real) training workflows using synthetic environments and teleoperated demonstration data

Optimize training pipelines, model parallelism, and evaluation frameworks

Work closely with robotics, hardware, controls, and safety teams to ensure model outputs are executable, safe, and predictable

Collaborate with product teams to define robot capabilities and user-facing behaviors

Participate in user and field testing to iterate on real-world performance

Qualification

Multimodal modelsPyTorchTraining pipelinesVision transformersGPU accelerationDataset creationRobotics simulationPythonEmbedded hardwareReinforcement learningMScPhD

Required

Strong experience with training multimodal models, including VLAs, VLMs, vision transformers, LLMs

Ability to build and iterate on large-scale training pipelines

Deep proficiency in PyTorch or JAX, distributed training, and GPU acceleration

Strong software engineering skills in Python and modern ML tooling

Experience with (synthetic) dataset creation and curation

Understanding of real-time deployment constraints on embedded hardware

MSc or PhD in Computer Science, Robotics, Machine Learning, or related field—or equivalent industry experience

Preferred

Familiarity with robotics simulation environments (Isaac Lab, Mujoco, or similar)

Hands-on experience with robotics, embodied AI, or reinforcement/imitation learning

Benefits

Health

Vision

Dental

401k

Company

Foundation

Foundation is developing the future of general purpose robotics with the goal to address the labor shortage.

Founded in 2024

San Francisco, California, USA

51-200 employees

https://foundation.bot/

H1B Sponsorship

Foundation has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

Funding

Current Stage

Growth Stage

Total Funding

unknown

2024-08-22Pre Seed

Company data provided by crunchbase