Thinking Machines Lab · 1 month ago
Research, Pre-Training Data
Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. The role of pre-training researchers is to blend research with large-scale data engineering to assemble pre-training datasets that support the next generation of AI models.
Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyProduct ResearchSoftware
Responsibilities
Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data
Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources
Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly
Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use
Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior
Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia
Qualification
Required
Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX). Comfortable with debugging distributed training and writing code that scales
Bachelor's degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding
Clarity in communication, an ability to explain complex technical concepts in writing
Preferred
A strong grasp of probability, statistics, and ML fundamentals. You can look at experimental data and distinguish between real effects, noise, and bugs
Experience with curation, preprocessing, and analysis of large-scale text, code, or multimodal datasets
Prior experience in data engineering, dataset construction, or large-scale web data processing for machine learning models
Experience evaluating or improving training data quality and knowledge of data ethics, safety, and licensing frameworks relevant to AI dataset creation
Contributions to open datasets, research publications, or data tooling
PhD in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding; or, equivalent industry research experience
Benefits
Generous health, dental, and vision benefits
Unlimited PTO
Paid parental leave
Relocation support as needed
Company
Thinking Machines Lab
Thinking Machines Lab is an AI research and product company that aims to increase understanding and customization of AI systems.
H1B Sponsorship
Thinking Machines Lab has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (9)
Funding
Current Stage
Early StageTotal Funding
$2.01BKey Investors
Andreessen HorowitzMinistry of Economy, Culture and Innovation
2025-06-20Seed· $2B
2025-05-05Grant· $9.98M
Leadership Team
Recent News
2026-01-20
Company data provided by crunchbase