Apply on Employer Site

P-1 AI · 9 hours ago

Machine Learning Engineer - Training & Infrastructure

San Francisco, CA

Full-time

Onsite

Mid Level

3+ years exp

P-1 AI is a company focused on building engineering AGI to revolutionize the built world through artificial intelligence. They are seeking a Machine Learning Engineer to manage large-scale LLM training operations, ensuring efficient and reliable training on GPU clusters and collaborating with researchers and ML engineers on model development.

Artificial Intelligence (AI)SoftwareWeb Development

H1B Sponsor Likely

Responsibilities

Own the training pipeline for large-scale LLM fine-tuning and post-training workflows

Configure, launch, monitor, and debug multi-node distributed training jobs using FSDP, DeepSpeed, or custom wrappers

Contribute to upstream and internal forks of training frameworks like TorchTune, TRL, and Hugging Face Transformers

Tune training parameters, memory footprints, and sharding strategies for optimal throughput

Work closely with infra and systems teams to maintain the health and utilization of our GPU clusters (e.g., Infiniband, NCCL, Slurm, Kubernetes)

Implement features or fixes to unblock novel use cases in our LLM training stack

Qualification

Large-scale ML systemsPyTorchMulti-node GPU trainingFSDPDeepSpeedNCCLCUDA memoryKubernetesSlurmReinforcement learning

Required

3+ years working with large-scale ML systems or training pipelines

Deep familiarity with PyTorch, especially distributed training via FSDP, DeepSpeed, or DDP

Comfortable navigating training libraries like TorchTune, Accelerate, or Trainer APIs

Practical experience with multi-node GPU training, including profiling, debugging, and optimizing jobs

Understanding of low-level components like NCCL, Infiniband, CUDA memory, and model partitioning strategies

You enjoy bridging research and engineering—making messy ideas actually run on hardware

Preferred

Experience maintaining Slurm, Ray, or Kubernetes clusters

Past contributions to open-source ML training frameworks

Exposure to model scaling laws, checkpointing formats (e.g., HF sharded safetensors vs. distcp), or mixed precision training

Familiarity with on-policy reinforcement learning setups with inference (policy rollouts) as part of the training loop, such as GRPO, PPO, or A2C

Experience working at a startup

Company

P-1 AI

P-1 AI is a technology company focused on developing an artificial general engineering intelligence (AGEI).

Founded in 2024

Henderson, Nevada, USA

2-10 employees

https://p-1.ai

H1B Sponsorship

P-1 AI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

Funding

Current Stage

Early Stage

Total Funding

$23M

Key Investors

Radical VenturesVillage Global

2025-04-28Seed· $23M

2024-07-30Pre Seed

Leadership Team

Paul Eremenko

Co-Founder & CEO

Company data provided by crunchbase