Inside Higher Ed · 1 week ago
ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term)
Stanford University is seeking an experienced ML Data Engineer to drive the programmatic curation, cleaning, and generation of healthcare data. The role focuses on developing and maintaining automated, ML-accelerated pipelines that ensure high-quality data ready for machine learning applications in a complex healthcare environment.
Digital MediaEducationHigher EducationJournalismRecruiting
Responsibilities
Design Big Data systems that are scalable, optimized and fault-tolerant
Work closely with scientific staff, IT professional and project managers to understand their data requirements for existing and future projects involving Big Data
Develop, test, implement, and maintain database management applications. Optimize and tune the system, perform software review and maintenance to ensure that data design elements are reusable, repeatable and robust
Contribute to the development of guidelines, standards, and processes to ensure data quality, integrity and security of systems and data appropriate to risk
Participate in and/or contribute to setting strategy and standards through data architecture and implementation, leveraging Big Data, analytics tools and technologies
Work with IT and data owners to understand the types of data collected in various databases and data warehouses
Research and suggest new toolsets/methods to improve data ingestion, storage, and data access
Design, implement, and maintain robust pipelines for the programmatic cleaning, transformation, and curation of healthcare data
Develop automated processes to curate and validate data, ensuring accuracy and compliance with healthcare standards (e.g. OMOP CDM, FHIR)
Leverage core machine learning techniques to generate datasets, clean existing health records, join heterogeneous data sources, and enhance data quality for model training
Implement innovative solutions to detect and correct data inconsistencies and anomalies in large-scale healthcare datasets
Work extensively with patient-level health data, ensuring that data handling practices adhere to industry regulations and ethical standards
Utilize the OMOP Common Data Model (OMOP CDM) to standardize and harmonize disparate healthcare data sources, enhancing interoperability and scalability
Collaborate closely with data scientists, clinical informaticians, and engineers to align data engineering practices with analytical and clinical requirements
Continuously monitor, troubleshoot, and optimize data workflows to support dynamic research and operational needs
Qualification
Required
Bachelor's degree in scientific or analytic field and five years of relevant experience, or a combination of education and relevant experience
3+ years of experience in software development and data engineering with a strong focus on data cleaning, transformation, and creation
Proficiency in Python and experience with data processing libraries (e.g., Pandas, Polars, NumPy)
Hands-on experience in building and maintaining automated data pipelines for large-scale data processing
Familiarity with machine learning frameworks (e.g., PyTorch, JAX, scikit-learn) as applied to data quality and augmentation tasks
Expertise in working with healthcare data, including familiarity with the OMOP Common Data Model (OMOP CDM)
Strong experience in a Linux environment and comfort with UNIX command-line tools
Proven ability to work collaboratively in multidisciplinary teams and communicate technical concepts effectively
Knowledge of key data structures algorithms, and techniques pertinent to systems that support high volume, velocity, or variety datasets (including data mining, machine learning, NLP, data retrieval)
Experience with relational, NoSQL, or NewSQL database systems and data modeling, structured and unstructured
Experience in parallel and distributed data processing techniques and platforms (MPI, Map/Reduce, Batch)
Experience in scripting languages and experience in debugging them, experience with high performance/systems languages and techniques
Knowledge of benchmark software development and programmable fields/systems, ability to analyze systems and data pipelines and propose solutions that leverage emerging technologies
Ability to use and integrate security controls for web applications, mobile platforms, and backend systems
Experience deploying reliable data systems and data quality management
Ability to research, evaluate, architect, and deploy new tools, frameworks, and patterns to build scalable Big Data platforms
Ability to document use cases, solutions and recommendations
Demonstrated excellence in written and verbal communication skills
Preferred
Experience with cloud platforms (e.g., GCP, AWS, or Azure) and distributed computing frameworks
Proficiency with version control systems (e.g., Git) and containerization tools (e.g., Docker)
Familiarity with healthcare data standards and regulatory requirements
Benefits
Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position.
The Cardinal at Work website provides detailed information on Stanford’s extensive range of benefits and rewards offered to employees.
The University will provide reasonable accommodations to applicants and employees with disabilities.
Company
Inside Higher Ed
Inside Higher Ed is the online source for news, opinion, and jobs related to higher education.
Funding
Current Stage
Growth StageTotal Funding
unknown2022-01-10Acquired
2006-08-31Series Unknown
Recent News
Research & Development World
2025-05-03
Business Standard India
2025-04-11
Company data provided by crunchbase