Liquid AI · 1 week ago
Member of Technical Staff - Training Infrastructure Engineer
Liquid AI, spun out of MIT, is focused on building efficient AI systems at every scale. They are seeking a Training Infrastructure Engineer to design and implement high-performance training infrastructure for their GPU clusters, enabling the development of specialized and large-scale multimodal models.
Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
Responsibilities
Design and implement high-performance, scalable training infrastructure that efficiently utilizes our GPU clusters for both specialized and large-scale multimodal models
Build robust data loading systems that eliminate I/O bottlenecks and enable training on diverse multimodal datasets
Develop sophisticated checkpointing mechanisms that balance memory constraints with recovery needs across different model scales
Optimize communication patterns between nodes to minimize the overhead of distributed training for long-running experiments
Collaborate with ML engineers to implement new model architectures and training algorithms at scale
Create monitoring and debugging tools to ensure training stability and resource efficiency across our infrastructure
Qualification
Required
extensive experience building distributed training infrastructure for language and multimodal models, with hands-on expertise in frameworks like PyTorch Distributed, DeepSpeed, or Megatron-LM
passionate about solving complex systems challenges in large-scale model training—from efficient multimodal data loading to sophisticated sharding strategies to robust checkpointing mechanisms
deep understanding of hardware accelerators and networking topologies, with the ability to optimize communication patterns for different parallelism strategies
skilled at identifying and resolving performance bottlenecks in training pipelines, whether they occur in data loading, computation, or communication between nodes
experience working with diverse data types (text, images, video, audio) and can build data pipelines that handle heterogeneous inputs efficiently
Preferred
implemented custom sharding techniques (tensor/pipeline/data parallelism) to scale training across distributed GPU clusters of varying sizes
experience optimizing data pipelines for multimodal datasets with sophisticated preprocessing requirements
built fault-tolerant checkpointing systems that can handle complex model states while minimizing training interruptions
contributed to open-source training infrastructure projects or frameworks
designed training infrastructure that works efficiently for both parameter-efficient specialized models and massive multimodal systems
Company
Liquid AI
Build efficient general-purpose AI at every scale.
H1B Sponsorship
Liquid AI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
Funding
Current Stage
Growth StageTotal Funding
$293.1MKey Investors
AMD VenturesOSS Capital L.P.
2024-12-13Series A· $250M
2023-12-01Seed· $37.5M
2023-05-05Seed· $5.6M
Recent News
2025-12-06
Digital Commerce 360
2025-11-15
Company data provided by crunchbase