Apply on Employer Site

Together AI · 4 days ago

LLM Training Frameworks and Optimization Engineer

San Francisco, CA

Full-time

Onsite

Senior Level

$160K/yr - $230K/yr

5+ years exp

Together AI is a research-driven artificial intelligence company focused on building innovative AI infrastructure. They are seeking an LLM Training Frameworks and Optimization Engineer to develop and optimize distributed training frameworks for large language models, ensuring robustness and efficiency in training pipelines.

Artificial Intelligence (AI)Generative AIInternetIT InfrastructureOpen Source

H1B Sponsor Likely

Responsibilities

Design, implement, and optimize distributed training frameworks tailored for large language models

Develop custom modules, plugins, and features to enhance framework scalability and performance

Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training

Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training

Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks

Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators

Ensure training systems scale efficiently to thousands of nodes and petabytes of data

Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines

Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements

Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle

Qualification

Distributed training frameworksParallelism techniquesDeep learning performance optimizationsPythonC++CUDAMemory optimization techniquesTraining dynamics for LLMsAnalytical problem-solvingCollaborationCommunication

Required

5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure

Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA)

Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism)

Familiarity with GPU/TPU hardware and deep learning performance optimizations

Proficient in Python and C++ or CUDA for high-performance computing

Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding)

Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization

Analytical problem-solving skills and a focus on performance improvement

Strong collaboration and communication skills across teams

Preferred

Familiarity with graph optimization and compiler-level performance tuning

Contributions to open-source deep learning or distributed training projects

Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels)

Benefits

Competitive compensation

Startup equity

Health insurance

Other competitive benefits

Company

Together AI

Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models.

Founded in 2022

San Francisco, California, USA

201-500 employees

https://www.together.ai

H1B Sponsorship

Together AI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (19)

2024 (6)

2023 (3)

Funding

Current Stage

Growth Stage

Total Funding

$533.5M

Key Investors

Salesforce VenturesLux Capital

2025-02-20Series B· $305M

2024-03-13Series A· $106M

2023-11-29Series A· $102.5M

Leadership Team

Vipul Ved Prakash

Co-Founder & CEO

Kae Ike Lim

Executive Assistant to Co-Founder and CEO

Recent News

Morningstar.com

AI21 Labs and Together AI Partner to Expand Access to Open-Source Models

2025-11-27

prnasia.com

PEGATRON Strengthens AI Infrastructure Collaboration with Together AI and 5C for NVIDIA GB300 NVL72 and NVIDIA HGX B200 Liquid-Cooled Rack Deployment in U.S. Data Centers

2025-11-19

KrASIA

Coding tools Cursor and Windsurf found using Chinese AI in latest releases

2025-11-07

Company data provided by crunchbase