SIGN IN
ML Infra Engineer (TPU/Jax/Optimization) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Physical Intelligence · 2 weeks ago

ML Infra Engineer (TPU/Jax/Optimization)

Physical Intelligence is focused on advancing physical intelligence through machine learning and scalable infrastructure. In this role, you will help scale and optimize training systems and core model code, owning critical infrastructure for large-scale training and collaborating closely with researchers and model engineers.
Artificial Intelligence (AI)Machine LearningRobotics
check
H1B Sponsor Likelynote

Responsibilities

Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging
Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction
Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization
Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments
Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost
Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale
Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics

Qualification

JAXGPU/TPU managementDistributed trainingPyTorchCloud platformsSoftware engineeringCross-functional communicationOwnership mindset

Required

Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms
Hands-on large-scale training experience in JAX (preferred), PyTorch
Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines
Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS)
Ability to debug and optimize performance bottlenecks across the training stack
Strong cross-functional communication and ownership mindset

Preferred

Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels)
Experience operating close to hardware (GPU/TPU performance tuning)
Background in robotics, multimodal models, or large-scale foundation models
Experience designing abstractions that balance researcher flexibility with system reliability

Company

Physical Intelligence

twittertwittertwitter
company-logo
Physical Intelligence is an AI company developing machine learning for robots and other physical devices.

H1B Sponsorship

Physical Intelligence has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (4)
2024 (1)

Funding

Current Stage
Growth Stage
Total Funding
$1.07B
Key Investors
CapitalGJeff Bezos,Lux Capital,Thrive CapitalThrive Capital
2025-11-20Series B· $600M
2024-11-04Series A· $400M
2024-03-12Seed· $70M

Leadership Team

leader-logo
Lachy Groom
Co-Founder
linkedin
Company data provided by crunchbase