Apply on Employer Site

Zyphra · 3 months ago

Machine Learning Data Engineer - Systems & Retrieval

San Francisco, CA

Full-time

Onsite

Mid Level

Zyphra is an artificial intelligence company based in San Francisco, California. The Machine Learning Data Engineer - Systems & Retrieval will build and optimize the data infrastructure for machine learning systems, focusing on designing high-performance data pipelines and architecting retrieval systems for LLMs.

Artificial Intelligence (AI)Cloud ComputingSoftwareMachine Learning

H1B Sponsored

Responsibilities

Design and implementation of distributed data ingestion and transformation pipelines

Building retrieval and indexing systems that support RAG and other LLM-based methods

Mining and organizing large unstructured datasets, both in research and production environments

Collaborating with ML engineers, systems engineers, and DevOps to scale pipelines and observability

Ensuring compliance and access control in data handling, with security and auditability in mind

Qualification

PythonData pipelinesDistributed data systemsIndexing techniquesDatabase systemsSecurity complianceETL systemsVector databasesData validationDebugging practicesCommunication skillsCollaboration experience

Required

Strong software engineering background with fluency in Python

Experience designing, building, and maintaining data pipelines in production environments

Deep understanding of data structures, storage formats, and distributed data systems

Familiarity with indexing and retrieval techniques for large-scale document corpora

Understanding of database systems (SQL and NoSQL), their internals, and performance characteristics

Strong attention to security, access controls, and compliance best practices (e.g., GDPR, SOC2)

Excellent debugging, observability, and logging practices to support reliability at scale

Strong communication skills and experience collaborating across ML, infra, and product teams

Preferred

Experience building or maintaining LLM-integrated retrieval systems (e.g, RAG pipelines)

Academic or industry background in data mining, search, recommendation systems, or IR literature

Experience with large-scale ETL systems and tools like Apache Beam, Spark, or similar

Familiarity with vector databases (e.g., FAISS, Weaviate, Pinecone) and embedding-based retrieval

Understanding of data validation and quality assurance in machine learning workflows

Experience working on cross-functional infra and MLOps teams

Knowledge of how data infrastructure supports training pipelines, inference serving, and feedback loops

Comfort working across raw, unstructured data, structured databases, and model-ready formats

Benefits

Comprehensive medical, dental, vision, and FSA plans

Competitive compensation and 401(k)

Relocation and immigration support on a case-by-case basis

On-site meals prepared by a dedicated culinary team; Thursday Happy Hours

Company

Zyphra

Zyphra is superintelligence research and product company based in San Francisco, California.

Founded in 2021

Palo Alto, California, USA

51-200 employees

https://zyphra.com

H1B Sponsorship

Zyphra has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (1)

Funding

Current Stage

Growth Stage

Total Funding

$100M

2025-06-09Series A· $100M

2023-06-09Seed

2021-11-18Pre Seed

Recent News

Hot Hardware

AMD Zyphra GPU Cluster Gives Birth To ZAYA 1 MoE AI Model, Smokes Llama3.1

2025-11-30

Benzinga.com

AMD Stock Bounces Back After Sharp Sell-Off: What's Going On?

2025-11-27

Morningstar.com

Zyphra Demonstrates First Large Scale Training on Integrated AMD Compute and Networking Powered by IBM Cloud

2025-11-24

Company data provided by crunchbase