LLM Training Frameworks and Optimization Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Together AI · 4 days ago

LLM Training Frameworks and Optimization Engineer

Together AI is a research-driven artificial intelligence company focused on building innovative AI infrastructure. They are seeking an LLM Training Frameworks and Optimization Engineer to develop and optimize distributed training frameworks for large language models, ensuring robustness and efficiency in training pipelines.

Artificial Intelligence (AI)Generative AIInternetIT InfrastructureOpen Source
check
H1B Sponsor Likelynote

Responsibilities

Design, implement, and optimize distributed training frameworks tailored for large language models
Develop custom modules, plugins, and features to enhance framework scalability and performance
Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training
Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training
Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks
Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators
Ensure training systems scale efficiently to thousands of nodes and petabytes of data
Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines
Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements
Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle

Qualification

Distributed training frameworksParallelism techniquesDeep learning performance optimizationsPythonC++CUDAMemory optimization techniquesTraining dynamics for LLMsAnalytical problem-solvingCollaborationCommunication

Required

5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure
Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA)
Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism)
Familiarity with GPU/TPU hardware and deep learning performance optimizations
Proficient in Python and C++ or CUDA for high-performance computing
Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding)
Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization
Analytical problem-solving skills and a focus on performance improvement
Strong collaboration and communication skills across teams

Preferred

Familiarity with graph optimization and compiler-level performance tuning
Contributions to open-source deep learning or distributed training projects
Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels)

Benefits

Competitive compensation
Startup equity
Health insurance
Other competitive benefits

Company

Together AI

twittertwittertwitter
company-logo
Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models.

H1B Sponsorship

Together AI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (19)
2024 (6)
2023 (3)

Funding

Current Stage
Growth Stage
Total Funding
$533.5M
Key Investors
Salesforce VenturesLux Capital
2025-02-20Series B· $305M
2024-03-13Series A· $106M
2023-11-29Series A· $102.5M

Leadership Team

leader-logo
Vipul Ved Prakash
Co-Founder & CEO
linkedin
leader-logo
Kae Ike Lim
Executive Assistant to Co-Founder and CEO
linkedin
Company data provided by crunchbase