Reddit, Inc. · 9 hours ago
Staff Research Engineer, Pre-training Data
Reddit is a community-driven platform seeking a Staff Research Engineer for Pre-training Data to enhance its AI Engineering team. This role involves defining the technical strategy for data curriculum pipelines to train foundational Large Language Models, ensuring they understand the unique culture of Reddit communities.
ContentNewsSocial MediaSocial Network
Responsibilities
Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale
Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities
Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding
Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality
Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts
Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure
Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments
Qualification
Required
8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training
Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders)
Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video
Strong mathematical foundation in probability, statistics, and importance sampling theory
Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance
Experience working with Graph data structures or serializing conversation trees is highly valued
Preferred
Experience with JAX or PyTorch internals related to distributed data loading
Experience with Multimodal datasets (image/video + text) and vision-language preprocessing
Proficiency in Rust or C++ for performance-critical data path optimization
Published research or significant practical experience in active learning or automated data selection
Benefits
Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits
Flexible Vacation & Paid Volunteer Time Off
Generous Paid Parental Leave
Company
Reddit, Inc.
Reddit is the heart of the internet, where millions of people get together to talk about any topic imaginable.
H1B Sponsorship
Reddit, Inc. has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (99)
2024 (63)
2023 (76)
2022 (70)
2021 (68)
2020 (39)
Funding
Current Stage
Public CompanyTotal Funding
$1.33BKey Investors
FidelityVy CapitalTencent
2024-03-21IPO
2021-08-12Series F· $410M
2021-02-08Series E· $367.95M
Recent News
2026-01-22
initialized.com
2026-01-21
2026-01-16
Company data provided by crunchbase