Staff Research Engineer, Pre-training Data jobs in United States
cer-icon
Apply on Employer Site
company-logo

Reddit, Inc. · 8 hours ago

Staff Research Engineer, Pre-training Data

Reddit is a community-driven platform seeking a Staff Research Engineer for Pre-training Data to enhance its AI Engineering team. This role involves defining the technical strategy for data curriculum pipelines to train foundational Large Language Models, ensuring they understand the unique culture of Reddit communities.

ContentNewsSocial MediaSocial Network
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale
Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities
Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding
Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality
Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts
Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure
Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments

Qualification

Machine learning infrastructurePythonDistributed data processingUnstructured data handlingMathematical foundationPre-training dynamicsGraph data structuresJAXPyTorchMultimodal datasetsRustC++Active learning experience

Required

8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training
Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders)
Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video
Strong mathematical foundation in probability, statistics, and importance sampling theory
Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance
Experience working with Graph data structures or serializing conversation trees is highly valued

Preferred

Experience with JAX or PyTorch internals related to distributed data loading
Experience with Multimodal datasets (image/video + text) and vision-language preprocessing
Proficiency in Rust or C++ for performance-critical data path optimization
Published research or significant practical experience in active learning or automated data selection

Benefits

Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits
Flexible Vacation & Paid Volunteer Time Off
Generous Paid Parental Leave

Company

Reddit, Inc.

company-logo
Reddit is the heart of the internet, where millions of people get together to talk about any topic imaginable.

H1B Sponsorship

Reddit, Inc. has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (99)
2024 (63)
2023 (76)
2022 (70)
2021 (68)
2020 (39)

Funding

Current Stage
Public Company
Total Funding
$1.33B
Key Investors
FidelityVy CapitalTencent
2024-03-21IPO
2021-08-12Series F· $410M
2021-02-08Series E· $367.95M

Leadership Team

leader-logo
Steve Huffman
CEO
linkedin
leader-logo
Chris Slowe
CTO
linkedin
Company data provided by crunchbase