Data Engineer, Scientific Data Ingestion jobs in United States
cer-icon
Apply on Employer Site
company-logo

Mithrl · 1 month ago

Data Engineer, Scientific Data Ingestion

Mithrl is a fast-growing tech-bio startup focused on accelerating breakthroughs in life sciences through their AI Co-Scientist platform. They are seeking a Data Engineer to build and manage an AI-powered data ingestion and normalization pipeline, ensuring data quality and integration for downstream analytics.

Artificial Intelligence (AI)Data Center AutomationLife ScienceMedicalSoftware

Responsibilities

Build and own an AI-powered ingestion & normalization pipeline to import data from a wide variety of sources — unprocessed Excel/CSV uploads, lab and instrument exports, as well as processed data from internal pipelines
Develop robust schema mapping, coercion, and conversion logic (think: units normalization, metadata standardization, variable-name harmonization, vendor-instrument quirks, plate-reader formats, reference-genome or annotation updates, batch-effect correction, etc.)
Use LLM-driven and classical data-engineering tools to structure “semi-structured” or messy tabular data — extracting metadata, inferring column roles/types, cleaning free-text headers, fixing inconsistencies, and preparing final clean datasets
Ensure all transformations that should only happen once (normalization, coercion, batch-correction) execute during ingestion — so downstream analytics / the AI “Co-Scientist” always works with clean, canonical data
Build validation, verification, and quality-control layers to catch ambiguous, inconsistent, or corrupt data before it enters the platform
Collaborate with product teams, data science / bioinformatics colleagues, and infrastructure engineers to define and enforce data standards, and ensure pipeline outputs integrate cleanly into downstream analysis and storage systems

Qualification

Data engineeringPythonETL/ELT pipelinesData processing toolsCloud infrastructureScientific dataWorkflow orchestration toolsComputational biology backgroundCommunication skillsCollaboration across teams

Required

5+ years of experience in data engineering / data wrangling with real-world tabular or semi-structured data
Strong fluency in Python, and data processing tools (Pandas, Polars, PyArrow, or similar)
Excellent experience dealing with messy Excel / CSV / spreadsheet-style data — inconsistent headers, multiple sheets, mixed formats, free-text fields — and normalizing it into clean structures
Comfort designing and maintaining robust ETL/ELT pipelines, ideally for scientific or lab-derived data
Ability to combine classical data engineering with LLM-powered data normalization / metadata extraction / cleaning
Strong desire and ability to own the ingestion & normalization layer end-to-end — from raw upload → final clean dataset — with an eye for maintainability, reproducibility, and scalability
Good communication skills; able to collaborate across teams (product, bioinformatics, infra) and translate real-world messy data problems into robust engineering solutions

Preferred

Familiarity with scientific data types and “modalities” (e.g. plate-readers, genomics metadata, time-series, batch-info, instrumentation outputs)
Experience with workflow orchestration tools (e.g. Nextflow, Prefect, Airflow, Dagster), or building pipeline abstractions
Experience with cloud infrastructure and data storage (AWS S3, data lakes/warehouses, database schemas) to support multi-tenant ingestion
Past exposure to LLM-based data transformation or cleansing agents — building or integrating tools that clean or structure messy data automatically
Any background in computational biology / lab-data / bioinformatics is a bonus — though not required

Benefits

Comprehensive PPO health coverage through Anthem (medical, dental, and vision)
401(k) with top-tier plans

Company

Mithrl

twittertwittertwitter
company-logo
Mithrl is a software development company that builds the custom workflows for NGS data on-demand.

Funding

Current Stage
Early Stage
Total Funding
$4M
Key Investors
Bonfire Ventures
2024-11-14Seed· $4M

Leadership Team

leader-logo
Shara Balakrishnan, Ph.D.
Chief Technology Officer
linkedin
Company data provided by crunchbase