Baseten · 1 week ago
Senior Software Engineer - Model Training
Baseten is a company that powers mission-critical inference for dynamic AI companies by providing infrastructure and developer tooling. They are seeking a Senior Software Engineer – Model Training to build infrastructure for large-scale training of foundation models, optimize GPU utilization, and collaborate with cross-functional teams to meet customer needs.
AI InfrastructureArtificial Intelligence (AI)Developer ToolsMachine LearningSoftwareSoftware Engineering
Responsibilities
Design, build, and maintain distributed training infrastructure for large-scale foundation models
Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar
Qualification
Required
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
5+ years of experience in ML infrastructure, distributed systems, or ML platform engineering, including 2+ years in a tech lead or manager role
Strong expertise in distributed training frameworks and orchestration (FSDP, DDP, ZeRO, Ray, Kubernetes, Slurm, or similar)
Hands-on experience building or scaling training infrastructure for LLMs or other foundation models
Deep understanding of GPU/accelerator hardware utilization, mixed precision training, and scaling efficiency
Proven ability to lead and mentor technical teams while delivering complex infrastructure projects
Excellent communication skills, with the ability to bridge technical depth and business needs
Preferred
Experience building APIs, SDKs, or developer tools for ML workflows
Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.)
Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines
Contributions to open-source distributed training or ML infra projects
Experience with cloud environments (AWS, GCP, Azure) and container orchestration
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Company
Baseten
Baseten is an AI infrastructure company that integrates machine learning into business operations, production, and processes.
H1B Sponsorship
Baseten has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (8)
2023 (1)
2020 (1)
Funding
Current Stage
Late StageTotal Funding
$585MKey Investors
BondGreylock
2026-01-20Series Unknown· $300M
2025-09-05Series D· $150M
2025-02-19Series C· $75M
Recent News
2026-01-23
2026-01-23
Company data provided by crunchbase