Member of Technical Staff - Training Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Liquid AI · 1 week ago

Member of Technical Staff - Training Infrastructure Engineer

Liquid AI, spun out of MIT, is focused on building efficient AI systems at every scale. They are seeking a Training Infrastructure Engineer to design and implement high-performance training infrastructure for their GPU clusters, enabling the development of specialized and large-scale multimodal models.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
check
H1B Sponsor Likelynote

Responsibilities

Design and implement high-performance, scalable training infrastructure that efficiently utilizes our GPU clusters for both specialized and large-scale multimodal models
Build robust data loading systems that eliminate I/O bottlenecks and enable training on diverse multimodal datasets
Develop sophisticated checkpointing mechanisms that balance memory constraints with recovery needs across different model scales
Optimize communication patterns between nodes to minimize the overhead of distributed training for long-running experiments
Collaborate with ML engineers to implement new model architectures and training algorithms at scale
Create monitoring and debugging tools to ensure training stability and resource efficiency across our infrastructure

Qualification

Distributed training infrastructurePyTorch DistributedDeepSpeedMegatron-LMHardware acceleratorsNetworking topologiesPerformance bottlenecksData pipelinesOpen-source contributionsCheckpointing systemsMultimodal datasetsCollaboration with ML engineers

Required

extensive experience building distributed training infrastructure for language and multimodal models, with hands-on expertise in frameworks like PyTorch Distributed, DeepSpeed, or Megatron-LM
passionate about solving complex systems challenges in large-scale model training—from efficient multimodal data loading to sophisticated sharding strategies to robust checkpointing mechanisms
deep understanding of hardware accelerators and networking topologies, with the ability to optimize communication patterns for different parallelism strategies
skilled at identifying and resolving performance bottlenecks in training pipelines, whether they occur in data loading, computation, or communication between nodes
experience working with diverse data types (text, images, video, audio) and can build data pipelines that handle heterogeneous inputs efficiently

Preferred

implemented custom sharding techniques (tensor/pipeline/data parallelism) to scale training across distributed GPU clusters of varying sizes
experience optimizing data pipelines for multimodal datasets with sophisticated preprocessing requirements
built fault-tolerant checkpointing systems that can handle complex model states while minimizing training interruptions
contributed to open-source training infrastructure projects or frameworks
designed training infrastructure that works efficiently for both parameter-efficient specialized models and massive multimodal systems

Company

Liquid AI

twittertwittertwitter
company-logo
Build efficient general-purpose AI at every scale.

H1B Sponsorship

Liquid AI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)

Funding

Current Stage
Growth Stage
Total Funding
$293.1M
Key Investors
AMD VenturesOSS Capital L.P.
2024-12-13Series A· $250M
2023-12-01Seed· $37.5M
2023-05-05Seed· $5.6M

Leadership Team

leader-logo
Ramin Hasani
Co-founder and CEO
linkedin
leader-logo
Mathias Lechner
Co-founder and CTO
Company data provided by crunchbase