Apply on Employer Site

Reddit, Inc. · 9 hours ago

Staff Research Engineer, Pre-training Data

United States

Full-time

Remote

Lead/Staff

$230K/yr - $322K/yr

8+ years exp

Reddit is a community-driven platform seeking a Staff Research Engineer for Pre-training Data to enhance its AI Engineering team. This role involves defining the technical strategy for data curriculum pipelines to train foundational Large Language Models, ensuring they understand the unique culture of Reddit communities.

ContentNewsSocial MediaSocial Network

Comp. & Benefits

H1B Sponsor Likely

Responsibilities

Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale

Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities

Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding

Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality

Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts

Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure

Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments

Qualification

Machine learning infrastructurePythonDistributed data processingUnstructured data handlingMathematical foundationPre-training dynamicsGraph data structuresJAXPyTorchMultimodal datasetsRustC++Active learning experience

Required

8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training

Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders)

Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video

Strong mathematical foundation in probability, statistics, and importance sampling theory

Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance

Experience working with Graph data structures or serializing conversation trees is highly valued

Preferred

Experience with JAX or PyTorch internals related to distributed data loading

Experience with Multimodal datasets (image/video + text) and vision-language preprocessing

Proficiency in Rust or C++ for performance-critical data path optimization

Published research or significant practical experience in active learning or automated data selection

Benefits

Comprehensive Healthcare Benefits and Income Replacement Programs

401k with Employer Match

Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support

Family Planning Support

Gender-Affirming Care

Mental Health & Coaching Benefits

Flexible Vacation & Paid Volunteer Time Off

Generous Paid Parental Leave

Company

Reddit, Inc.

Glassdoor3.8

Reddit is the heart of the internet, where millions of people get together to talk about any topic imaginable.

Founded in 2005

San Francisco, California, USA

1001-5000 employees

https://www.redditinc.com

H1B Sponsorship

Reddit, Inc. has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (99)

2024 (63)

2023 (76)

2022 (70)

2021 (68)

2020 (39)

Funding

Current Stage

Public Company

Total Funding

$1.33B

Key Investors

FidelityVy CapitalTencent

2024-03-21IPO

2021-08-12Series F· $410M

2021-02-08Series E· $367.95M

Leadership Team

Steve Huffman

CEO

Chris Slowe

CTO

Recent News

BusinessCloud

Could a UK social media ban for under-16s be on the horizon?

2026-01-22

initialized.com

Reddit: Initialized Capital

2026-01-21

TradingView

Reddit CFO Andrew Vollero Sells Shares Worth Over $1 Million

2026-01-16

Company data provided by crunchbase