SIGN IN
Founding Engineer, Data Platform jobs in United States
cer-icon
Apply on Employer Site
company-logo

Elicit · 8 hours ago

Founding Engineer, Data Platform

Elicit is an AI research assistant that helps researchers and decision makers tackle complex questions using language models. The Founding Engineer for the Data Platform will be responsible for architecting and implementing scalable data ingestion systems to enhance the platform's capabilities with diverse data sources.
Artificial Intelligence (AI)DatabaseData Center AutomationInformation Technology
check
Growth Opportunities

Responsibilities

Building and optimizing our academic research paper pipeline
You'll architect and implement robust, scalable systems to handle data ingestion while maintaining high performance and quality
You'll work on efficiently deduplicating hundreds of millions of research papers, and calculating embeddings
Your goal will be to make Elicit the most complete and up-to-date database of scholarly sources
Expanding the datasets Elicit works over
Our users want Elicit to work over court documents, SEC filings, … your job will be to figure out how to ingest and index a rapidly increasing ontology of documents
We also want to support less structured documents, spreadsheets, presentations, all the way up to rich media like audio and video
Larger customers often want for us to integrate private data into Elicit for their organisation to use. We'll look to you to define and build a secure, reliable, fast, and auditable approach to these data connectors
Data for our ML systems
You'll figure out the best way to preprocess all these data mentioned above to make them useful to models
We often need datasets for our model fine-tuning. You'll work with our ML engineers and evaluation experts to find, gather, version, and apply these datasets in training runs

Qualification

PythonData engineeringSparkSQLData pipeline architectureData quality managementFull-text extractionMachine learning conceptsDistributed computing frameworksColumnar data storageDeduplication processesCreative problem-solving

Required

5+ years of experience as a data engineer: owning make-or-break decisions about how to ingest, manage, and use data
Strong proficiency in Python (5+ years experience)
You have created and owned a data platform at rapidly-growing startups—gathering needs from colleagues, planning an architecture, deploying the infrastructure, and implementing the tooling
Experience with architecting and optimizing large data pipelines, ideally with particular experience with Spark; ideally these are pipelines which directly support user-facing features (rather than internal BI, for example)
Strong SQL skills, including understanding of aggregation functions, window functions, UDFs, self-joins, partitioning, and clustering approaches
Experience with columnar data storage formats like Parquet
Strong opinions, weakly-held about approaches to data quality management
Creative and user-centric problem-solving
You should be excited to play a key role in shipping new features to users—not just building out a data platform!

Preferred

Experience in developing deduplication processes for large datasets
Hands-on experience with full-text extraction and processing from various document formats (PDF, HTML, XML, etc.)
Familiarity with machine learning concepts and their application in search technologies
Experience with distributed computing frameworks beyond Spark (e.g., Dask, Ray)
Experience in science and academia: familiarity with academic publications, and the ability to accurately model the needs of our users
Hands-on experience with industry standard tools like Airflow, DBT, or Hadoop
Hands-on experience with standard paradigms like data lake, data warehouse, or lakehouse

Benefits

Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8), as long as you can travel for in-person offsites
Fully covered health, dental, vision, and life insurance for you, generous coverage for the rest of your family
Flexible vacation policy, with a minimum recommendation of 20 days/year + company holidays
401K with a 6% employer match
A new Mac + $1,000 budget to set up your workstation or home office in your first year, then $500 every year thereafter
$1,000 quarterly AI Experimentation & Learning budget, so you can freely experiment with new AI tools to incorporate into your workflow, take courses, purchase educational resources, or attend AI-focused conferences and events
A team administrative assistant who can help you with personal and work tasks

Company

Elicit

twittertwittertwitter
company-logo
Elicit uses language models to help users automate research workflows.

Funding

Current Stage
Early Stage
Total Funding
$31M
Key Investors
Footwork,Spark CapitalFifty Years
2025-02-26Series A· $22M
2023-09-25Seed· $9M

Leadership Team

leader-logo
Andreas Stuhlmüller
Cofounder & CEO
linkedin
leader-logo
Jungwon Byun
Cofounder & COO
linkedin
Company data provided by crunchbase