Apple · 8 hours ago
Senior Research Engineer, Training Data Infrastructure in Foundation Models
Apple is dedicated to solving the high-quality training data problem at scale for advanced Foundation Models. They are seeking a Senior Research Engineer to design systems that focus on the statistical distribution and quality of data, working closely with Research Scientists to transform theoretical observations into scalable engineering solutions.
AppsArtificial Intelligence (AI)BroadcastingDigital EntertainmentFoundational AIMedia and EntertainmentMobile DevicesOperating SystemsTVWearables
Responsibilities
Architect Scalable Ingestion Systems: Design and implement high-throughput distributed systems to ingest petabytes of text and multimodal data from diverse sources, including web crawls and third-party partnerships
Repository Optimization: Manage the lifecycle of large-scale datasets across data storage and high-performance file systems. Optimize data formats for efficient random access and sequential scanning during model training
Data Governance & Privacy: Engineer robust data governance and privacy solutions for the training data, in collaboration with compliance and legal teams, to ensure adherence to stringent regulatory standards
High-Performance Processing Pipelines: Build and maintain distributed data processing workflows using advanced frameworks on cloud infrastructure (e.g., GCP, AWS)
Algorithmic Data Curation: Implement sophisticated data filtering and selection logic to remove low-quality content. Develop semantic deduplication at scale to prevent model memorization and improve training efficiency
Decontamination Removal: Design automated systems to detect and remove benchmark leakage, ensuring that evaluation datasets remain strictly isolated from training corpora
Infrastructure for Scaling Laws: Collaborate with researchers to enable data ablations and scaling experiments. Build tools to support systematic data mixture optimization and empirically data studies
Qualification
Required
Education: Bachelor's degree in Computer Science, Electrical Engineering, or Mathematics
Technical Expertise: 4+ years of software engineering experience with a specific focus on Data Infrastructure, Distributed Systems, or AI/ML Engineering
Language Proficiency: Expert fluency in Python, and strong competence in system languages such as C++
Cloud Architecture: Extensive experience architecting solutions on major public cloud platforms (e.g. GCP) to build scalable data systems (e.g. with Apache Beam, GCS)
Performance Engineering: Deep experience profiling and optimizing high-throughput data systems. Demonstrated ability to debug distributed bottlenecks (e.g., stragglers, I/O saturation), optimize data formats and provide efficient data storage solutions
Preferred
Research Collaboration: Experience working within or closely with ML research organizations (e.g., as a Research Engineer), with an ability to translate research results into engineering implementations
Domain Knowledge: Familiarity with lifecycle of modern LLM training, end-to-end workflows, and underlying system architecture
Complex Data Types: Experience in processing complex data modalities beyond plain text, such as source code repositories, images, videos, and audios
Benefits
Comprehensive medical and dental coverage
Retirement benefits
A range of discounted products and free services
Reimbursement for certain educational expenses — including tuition
Discretionary bonuses or commission payments
Relocation
Company
Apple
Apple is a technology company that designs, manufactures, and markets consumer electronics, personal computers, and software.
H1B Sponsorship
Apple has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6998)
2024 (3766)
2023 (3939)
2022 (4822)
2021 (4060)
2020 (3656)
Funding
Current Stage
Late StageTotal Funding
$5.67BKey Investors
Berkshire HathawayMicrosoftSequoia Capital
2026-01-10Pre Seed· $1M
2025-05-05Post Ipo Debt· $4.5B
2025-01-16Post Ipo Debt· $0.31M
Leadership Team
Tim Cook
CEO
Craig Federighi
SVP, Software Engineering
Recent News
Venrock
2025-12-01
2025-09-25
Mac Daily News
2025-09-25
Company data provided by crunchbase