Together AI · 4 days ago
LLM Training Frameworks and Optimization Engineer
Together AI is a research-driven artificial intelligence company focused on building innovative AI infrastructure. They are seeking an LLM Training Frameworks and Optimization Engineer to develop and optimize distributed training frameworks for large language models, ensuring robustness and efficiency in training pipelines.
Artificial Intelligence (AI)Generative AIInternetIT InfrastructureOpen Source
Responsibilities
Design, implement, and optimize distributed training frameworks tailored for large language models
Develop custom modules, plugins, and features to enhance framework scalability and performance
Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training
Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training
Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks
Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators
Ensure training systems scale efficiently to thousands of nodes and petabytes of data
Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines
Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements
Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle
Qualification
Required
5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure
Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA)
Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism)
Familiarity with GPU/TPU hardware and deep learning performance optimizations
Proficient in Python and C++ or CUDA for high-performance computing
Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding)
Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization
Analytical problem-solving skills and a focus on performance improvement
Strong collaboration and communication skills across teams
Preferred
Familiarity with graph optimization and compiler-level performance tuning
Contributions to open-source deep learning or distributed training projects
Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels)
Benefits
Competitive compensation
Startup equity
Health insurance
Other competitive benefits
Company
Together AI
Together AI is a cloud-based platform designed for constructing open-source generative AI and infrastructure for developing AI models.
H1B Sponsorship
Together AI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (19)
2024 (6)
2023 (3)
Funding
Current Stage
Growth StageTotal Funding
$533.5MKey Investors
Salesforce VenturesLux Capital
2025-02-20Series B· $305M
2024-03-13Series A· $106M
2023-11-29Series A· $102.5M
Leadership Team
Recent News
2025-11-27
Company data provided by crunchbase