ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Stanford University School of Medicine · 5 hours ago

ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term)

Stanford University is seeking an experienced ML Data Engineer to drive the programmatic curation, cleaning, and generation of healthcare data. This role focuses on developing and maintaining automated, ML-accelerated pipelines that ensure high-quality data ready for machine learning applications in a complex healthcare environment.

EducationHigher EducationMedical
check
H1B Sponsor Likelynote

Responsibilities

Design Big Data systems that are scalable, optimized and fault-tolerant
Work closely with scientific staff, IT professional and project managers to understand their data requirements for existing and future projects involving Big Data
Develop, test, implement, and maintain database management applications. Optimize and tune the system, perform software review and maintenance to ensure that data design elements are reusable, repeatable and robust
Contribute to the development of guidelines, standards, and processes to ensure data quality, integrity and security of systems and data appropriate to risk
Participate in and/or contribute to setting strategy and standards through data architecture and implementation, leveraging Big Data, analytics tools and technologies
Work with IT and data owners to understand the types of data collected in various databases and data warehouses
Research and suggest new toolsets/methods to improve data ingestion, storage, and data access
Design, implement, and maintain robust pipelines for the programmatic cleaning, transformation, and curation of healthcare data
Develop automated processes to curate and validate data, ensuring accuracy and compliance with healthcare standards (e.g. OMOP CDM, FHIR)
Leverage core machine learning techniques to generate datasets, clean existing health records, join heterogeneous data sources, and enhance data quality for model training
Implement innovative solutions to detect and correct data inconsistencies and anomalies in large-scale healthcare datasets
Work extensively with patient-level health data, ensuring that data handling practices adhere to industry regulations and ethical standards
Utilize the OMOP Common Data Model (OMOP CDM) to standardize and harmonize disparate healthcare data sources, enhancing interoperability and scalability
Collaborate closely with data scientists, clinical informaticians, and engineers to align data engineering practices with analytical and clinical requirements
Continuously monitor, troubleshoot, and optimize data workflows to support dynamic research and operational needs

Qualification

PythonData Pipeline EngineeringHealthcare Data ExpertiseMachine Learning FrameworksBig Data ArchitectureAutomated Data PipelinesLinux EnvironmentCloud PlatformsCollaborationCommunication Skills

Required

3+ years of experience in software development and data engineering with a strong focus on data cleaning, transformation, and creation
Proficiency in Python and experience with data processing libraries (e.g., Pandas, Polars, NumPy)
Hands-on experience in building and maintaining automated data pipelines for large-scale data processing
Familiarity with machine learning frameworks (e.g., PyTorch, JAX, scikit-learn) as applied to data quality and augmentation tasks
Expertise in working with healthcare data, including familiarity with the OMOP Common Data Model (OMOP CDM)
Strong experience in a Linux environment and comfort with UNIX command-line tools
Proven ability to work collaboratively in multidisciplinary teams and communicate technical concepts effectively
Bachelor's degree in scientific or analytic field and five years of relevant experience, or a combination of education and relevant experience
Knowledge of key data structures algorithms, and techniques pertinent to systems that support high volume, velocity, or variety datasets (including data mining, machine learning, NLP, data retrieval)
Experience with relational, NoSQL, or NewSQL database systems and data modeling, structured and unstructured
Experience in parallel and distributed data processing techniques and platforms (MPI, Map/Reduce, Batch)
Experience in scripting languages and experience in debugging them, experience with high performance/systems languages and techniques
Knowledge of benchmark software development and programmable fields/systems, ability to analyze systems and data pipelines and propose solutions that leverage emerging technologies
Ability to use and integrate security controls for web applications, mobile platforms, and backend systems
Experience deploying reliable data systems and data quality management
Ability to research, evaluate, architect, and deploy new tools, frameworks, and patterns to build scalable Big Data platforms
Ability to document use cases, solutions and recommendations
Demonstrated excellence in written and verbal communication skills

Preferred

Experience with cloud platforms (e.g., GCP, AWS, or Azure) and distributed computing frameworks
Proficiency with version control systems (e.g., Git) and containerization tools (e.g., Docker)
Familiarity with healthcare data standards and regulatory requirements

Benefits

Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position.
The Cardinal at Work website ( https://cardinalatwork.stanford.edu/benefits-rewards ) provides detailed information on Stanford’s extensive range of benefits and rewards offered to employees.

Company

Stanford University School of Medicine

company-logo
Stanford University School of Medicine is the medical school of Stanford University. It is a sub-organization of Stanford University.

H1B Sponsorship

Stanford University School of Medicine has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (551)
2024 (499)
2023 (472)
2022 (390)
2021 (336)
2020 (260)

Funding

Current Stage
Late Stage
Total Funding
$10M
Key Investors
American Medical Association
2023-06-21Grant
2017-07-19Grant· $10M

Leadership Team

leader-logo
Darius M. Moshfeghi
Chief of Retina Division: Vitreoretinal Surgery and Medical Diseases
linkedin
leader-logo
Quinton R. Markett
Research Professional, Utz Lab, Department of Medicine, Division of Immunology & Rheumatology
linkedin
Company data provided by crunchbase