Apply on Employer Site

Inside Higher Ed · 1 week ago

ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term)

Stanford, CA

Full-time

Hybrid

Mid, Senior Level

$158K/yr - $177K/yr

5+ years exp

Stanford University is seeking an experienced ML Data Engineer to drive the programmatic curation, cleaning, and generation of healthcare data. The role focuses on developing and maintaining automated, ML-accelerated pipelines that ensure high-quality data ready for machine learning applications in a complex healthcare environment.

Digital MediaEducationHigher EducationJournalismRecruiting

Responsibilities

Design Big Data systems that are scalable, optimized and fault-tolerant

Work closely with scientific staff, IT professional and project managers to understand their data requirements for existing and future projects involving Big Data

Develop, test, implement, and maintain database management applications. Optimize and tune the system, perform software review and maintenance to ensure that data design elements are reusable, repeatable and robust

Contribute to the development of guidelines, standards, and processes to ensure data quality, integrity and security of systems and data appropriate to risk

Participate in and/or contribute to setting strategy and standards through data architecture and implementation, leveraging Big Data, analytics tools and technologies

Work with IT and data owners to understand the types of data collected in various databases and data warehouses

Research and suggest new toolsets/methods to improve data ingestion, storage, and data access

Design, implement, and maintain robust pipelines for the programmatic cleaning, transformation, and curation of healthcare data

Develop automated processes to curate and validate data, ensuring accuracy and compliance with healthcare standards (e.g. OMOP CDM, FHIR)

Leverage core machine learning techniques to generate datasets, clean existing health records, join heterogeneous data sources, and enhance data quality for model training

Implement innovative solutions to detect and correct data inconsistencies and anomalies in large-scale healthcare datasets

Work extensively with patient-level health data, ensuring that data handling practices adhere to industry regulations and ethical standards

Utilize the OMOP Common Data Model (OMOP CDM) to standardize and harmonize disparate healthcare data sources, enhancing interoperability and scalability

Collaborate closely with data scientists, clinical informaticians, and engineers to align data engineering practices with analytical and clinical requirements

Continuously monitor, troubleshoot, and optimize data workflows to support dynamic research and operational needs

Qualification

PythonData Pipeline EngineeringHealthcare Data ExpertiseMachine Learning FrameworksBig Data Systems DesignAutomated Data PipelinesLinux EnvironmentData Quality ManagementCollaborationCommunication Skills

Required

Bachelor's degree in scientific or analytic field and five years of relevant experience, or a combination of education and relevant experience

3+ years of experience in software development and data engineering with a strong focus on data cleaning, transformation, and creation

Proficiency in Python and experience with data processing libraries (e.g., Pandas, Polars, NumPy)

Hands-on experience in building and maintaining automated data pipelines for large-scale data processing

Familiarity with machine learning frameworks (e.g., PyTorch, JAX, scikit-learn) as applied to data quality and augmentation tasks

Expertise in working with healthcare data, including familiarity with the OMOP Common Data Model (OMOP CDM)

Strong experience in a Linux environment and comfort with UNIX command-line tools

Proven ability to work collaboratively in multidisciplinary teams and communicate technical concepts effectively

Knowledge of key data structures algorithms, and techniques pertinent to systems that support high volume, velocity, or variety datasets (including data mining, machine learning, NLP, data retrieval)

Experience with relational, NoSQL, or NewSQL database systems and data modeling, structured and unstructured

Experience in parallel and distributed data processing techniques and platforms (MPI, Map/Reduce, Batch)

Experience in scripting languages and experience in debugging them, experience with high performance/systems languages and techniques

Knowledge of benchmark software development and programmable fields/systems, ability to analyze systems and data pipelines and propose solutions that leverage emerging technologies

Ability to use and integrate security controls for web applications, mobile platforms, and backend systems

Experience deploying reliable data systems and data quality management

Ability to research, evaluate, architect, and deploy new tools, frameworks, and patterns to build scalable Big Data platforms

Ability to document use cases, solutions and recommendations

Demonstrated excellence in written and verbal communication skills