Baseten · 2 hours ago
Senior Software Engineer - Model Training
Baseten powers mission-critical inference for the world's most dynamic AI companies, and they are seeking a Senior Software Engineer – Model Training to build the infrastructure for large-scale training of foundation models. The role involves designing distributed training systems, optimizing GPU utilization, and collaborating with teams to meet customer needs.
Artificial Intelligence (AI)Developer ToolsMachine LearningSoftwareSoftware Engineering
Responsibilities
Design, build, and maintain distributed training infrastructure for large-scale foundation models
Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar
Qualification
Required
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
5+ years of experience in ML infrastructure, distributed systems, or ML platform engineering, including 2+ years in a tech lead or manager role
Strong expertise in distributed training frameworks and orchestration (FSDP, DDP, ZeRO, Ray, Kubernetes, Slurm, or similar)
Hands-on experience building or scaling training infrastructure for LLMs or other foundation models
Deep understanding of GPU/accelerator hardware utilization, mixed precision training, and scaling efficiency
Proven ability to lead and mentor technical teams while delivering complex infrastructure projects
Excellent communication skills, with the ability to bridge technical depth and business needs
Preferred
Experience building APIs, SDKs, or developer tools for ML workflows
Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.)
Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines
Contributions to open-source distributed training or ML infra projects
Experience with cloud environments (AWS, GCP, Azure) and container orchestration
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Company
Baseten
Baseten is an AI infrastructure company that integrates machine learning into business operations, production, and processes.
H1B Sponsorship
Baseten has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (8)
2023 (1)
2020 (1)
Funding
Current Stage
Late StageTotal Funding
$285MKey Investors
BondGreylock
2025-09-05Series D· $150M
2025-02-19Series C· $75M
2024-03-04Series B· $40M
Recent News
2025-12-13
Tech Startups - Tech News, Tech Trends & Startup Funding
2025-12-11
Tech Startups - Tech News, Tech Trends & Startup Funding
2025-12-11
Company data provided by crunchbase