Research Engineer, Infrastructure, Numerics jobs in United States
cer-icon
Apply on Employer Site
company-logo

Thinking Machines Lab · 1 month ago

Research Engineer, Infrastructure, Numerics

Thinking Machines Lab is dedicated to advancing collaborative general intelligence. They are seeking an infrastructure research engineer to design and optimize systems for large-scale model training, focusing on numerics and ensuring stability and efficiency across distributed environments.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyProduct ResearchSoftware
check
H1B Sponsorednote

Responsibilities

Design and optimize distributed training infrastructure for large-scale LLMs, focusing on performance, stability, and reproducibility across multi-GPU and multi-node setups
Implement and evaluate low-precision numerics (for example, BF16, MXFP8, NVFP4) to improve efficiency without sacrificing model quality
Develop kernels and communication primitives that use hardware-level support for mixed and low-precision arithmetic
Collaborate with research teams to co-design model architectures and training recipes that align with emerging numeric formats and stability constraints
Prototype and benchmark scaling strategies such as data, tensor, and pipeline parallelism that integrate precision-adaptive computation and quantized communication
Contribute to the design of our internal orchestration and monitoring systems to ensure that thousands of distributed experiments can run efficiently and reproducibly
Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure

Qualification

Distributed training infrastructureLow-precision numericsDeep learning frameworksFloating-point numericsCollaborative environmentEngineering skillsBias for actionOpen-source contributionsLarge-scale AI models

Required

Bachelor's degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships
Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases in areas such as floating-point numerics, low-precision arithmetic, and distributed systems

Preferred

Familiarity with distributed frameworks such as PyTorch/XLA, DeepSpeed, Megatron-LM
Experience implementing FP8, INT8, or block-floating point (MX) formats and understanding their numerical trade-offs
Prior contributions to open-source deep learning infrastructure such as PyTorch, DeepSpeed, or XLA
Publications, patents, or projects related to numerical optimization, communication-efficient training, or systems for large models
Experience training and supporting large-scale AI models
Track record of improving research productivity through infrastructure design or process improvements

Benefits

Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support as needed

Company

Thinking Machines Lab

twittertwittertwitter
company-logo
Thinking Machines Lab is an AI research and product company that aims to increase understanding and customization of AI systems.

H1B Sponsorship

Thinking Machines Lab has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9)

Funding

Current Stage
Early Stage
Total Funding
$2.01B
Key Investors
Andreessen HorowitzMinistry of Economy, Culture and Innovation
2025-06-20Seed· $2B
2025-05-05Grant· $9.98M

Leadership Team

leader-logo
Mira Murati
Co-Founder and Chief Executive Officer
linkedin
leader-logo
Soumith Chintala
CTO
linkedin
Company data provided by crunchbase