Senior Machine Learning Engineer, ML Training Platform jobs in United States
cer-icon
Apply on Employer Site
company-logo

Reddit, Inc. · 3 hours ago

Senior Machine Learning Engineer, ML Training Platform

Reddit is a community of communities built on shared interests and trust, and they are seeking a Senior Machine Learning Engineer for their Machine Learning Platform team. The role involves architecting and maintaining foundational ML infrastructure to support various applications, including recommendations and content discovery.

ContentNewsSocial MediaSocial Network
check
Comp. & Benefits
check
H1B Sponsor Likelynote

Responsibilities

Lead the building, testing, and maintenance of ML training infrastructure at Reddit
Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows
Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance
GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully
Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the 'Idea-to-Prototype' loop, and standardize software environments (Docker images, Python dependency management)

Qualification

Machine Learning InfrastructureKubernetes ExpertisePythonGPU ExperienceCloud Provider ExperienceDistributed Training FrameworksDeveloper ExperienceOrganizational SkillsCommunication Skills

Required

5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems
Deep Kubernetes Expertise: You know K8s beyond just 'deploying pods.' You understand CRDs, Controllers and the Operator pattern
Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms
Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling)
GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes
Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP
Experience working with distributed training frameworks, including Ray and Kubernetes
Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems
Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle
Strong organizational & communication skills

Benefits

Comprehensive Healthcare Benefits and Income Replacement Programs
401k Match
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits
Flexible Vacation & Reddit Global Days off
Generous paid Parental Leave
Paid Volunteer time off

Company

Reddit, Inc.

company-logo
Reddit is the heart of the internet, where millions of people get together to talk about any topic imaginable.

H1B Sponsorship

Reddit, Inc. has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (99)
2024 (63)
2023 (76)
2022 (70)
2021 (68)
2020 (39)

Funding

Current Stage
Public Company
Total Funding
$1.33B
Key Investors
FidelityVy CapitalTencent
2024-03-21IPO
2021-08-12Series F· $410M
2021-02-08Series E· $367.95M

Leadership Team

leader-logo
Steve Huffman
CEO
linkedin
leader-logo
Chris Slowe
CTO
linkedin
Company data provided by crunchbase