Senior Software Engineer - Model Training jobs in United States
cer-icon
Apply on Employer Site
company-logo

Baseten · 2 hours ago

Senior Software Engineer - Model Training

Baseten powers mission-critical inference for the world's most dynamic AI companies, and they are seeking a Senior Software Engineer – Model Training to build the infrastructure for large-scale training of foundation models. The role involves designing distributed training systems, optimizing GPU utilization, and collaborating with teams to meet customer needs.

Artificial Intelligence (AI)Developer ToolsMachine LearningSoftwareSoftware Engineering
check
H1B Sponsor Likelynote

Responsibilities

Design, build, and maintain distributed training infrastructure for large-scale foundation models
Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar

Qualification

Distributed training frameworksML infrastructure experienceGPU utilization optimizationKubernetesCloud environmentsCommunicationTeam leadership

Required

Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
5+ years of experience in ML infrastructure, distributed systems, or ML platform engineering, including 2+ years in a tech lead or manager role
Strong expertise in distributed training frameworks and orchestration (FSDP, DDP, ZeRO, Ray, Kubernetes, Slurm, or similar)
Hands-on experience building or scaling training infrastructure for LLMs or other foundation models
Deep understanding of GPU/accelerator hardware utilization, mixed precision training, and scaling efficiency
Proven ability to lead and mentor technical teams while delivering complex infrastructure projects
Excellent communication skills, with the ability to bridge technical depth and business needs

Preferred

Experience building APIs, SDKs, or developer tools for ML workflows
Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.)
Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines
Contributions to open-source distributed training or ML infra projects
Experience with cloud environments (AWS, GCP, Azure) and container orchestration

Benefits

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Company

Baseten

twittertwittertwitter
company-logo
Baseten is an AI infrastructure company that integrates machine learning into business operations, production, and processes.

H1B Sponsorship

Baseten has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (8)
2023 (1)
2020 (1)

Funding

Current Stage
Late Stage
Total Funding
$285M
Key Investors
BondGreylock
2025-09-05Series D· $150M
2025-02-19Series C· $75M
2024-03-04Series B· $40M

Leadership Team

leader-logo
Aaron Relph
Design
linkedin
Company data provided by crunchbase