ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Inside Higher Ed · 1 week ago

ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term)

Stanford University is seeking an experienced ML Data Engineer to drive the programmatic curation, cleaning, and generation of healthcare data. The role focuses on developing and maintaining automated, ML-accelerated pipelines that ensure high-quality data ready for machine learning applications in a complex healthcare environment.

Digital MediaEducationHigher EducationJournalismRecruiting

Responsibilities

Design Big Data systems that are scalable, optimized and fault-tolerant
Work closely with scientific staff, IT professional and project managers to understand their data requirements for existing and future projects involving Big Data
Develop, test, implement, and maintain database management applications. Optimize and tune the system, perform software review and maintenance to ensure that data design elements are reusable, repeatable and robust
Contribute to the development of guidelines, standards, and processes to ensure data quality, integrity and security of systems and data appropriate to risk
Participate in and/or contribute to setting strategy and standards through data architecture and implementation, leveraging Big Data, analytics tools and technologies
Work with IT and data owners to understand the types of data collected in various databases and data warehouses
Research and suggest new toolsets/methods to improve data ingestion, storage, and data access
Design, implement, and maintain robust pipelines for the programmatic cleaning, transformation, and curation of healthcare data
Develop automated processes to curate and validate data, ensuring accuracy and compliance with healthcare standards (e.g. OMOP CDM, FHIR)
Leverage core machine learning techniques to generate datasets, clean existing health records, join heterogeneous data sources, and enhance data quality for model training
Implement innovative solutions to detect and correct data inconsistencies and anomalies in large-scale healthcare datasets
Work extensively with patient-level health data, ensuring that data handling practices adhere to industry regulations and ethical standards
Utilize the OMOP Common Data Model (OMOP CDM) to standardize and harmonize disparate healthcare data sources, enhancing interoperability and scalability
Collaborate closely with data scientists, clinical informaticians, and engineers to align data engineering practices with analytical and clinical requirements
Continuously monitor, troubleshoot, and optimize data workflows to support dynamic research and operational needs

Qualification

PythonData Pipeline EngineeringHealthcare Data ExpertiseMachine Learning FrameworksBig Data Systems DesignAutomated Data PipelinesLinux EnvironmentData Quality ManagementCollaborationCommunication Skills

Required

Bachelor's degree in scientific or analytic field and five years of relevant experience, or a combination of education and relevant experience
3+ years of experience in software development and data engineering with a strong focus on data cleaning, transformation, and creation
Proficiency in Python and experience with data processing libraries (e.g., Pandas, Polars, NumPy)
Hands-on experience in building and maintaining automated data pipelines for large-scale data processing
Familiarity with machine learning frameworks (e.g., PyTorch, JAX, scikit-learn) as applied to data quality and augmentation tasks
Expertise in working with healthcare data, including familiarity with the OMOP Common Data Model (OMOP CDM)
Strong experience in a Linux environment and comfort with UNIX command-line tools
Proven ability to work collaboratively in multidisciplinary teams and communicate technical concepts effectively
Knowledge of key data structures algorithms, and techniques pertinent to systems that support high volume, velocity, or variety datasets (including data mining, machine learning, NLP, data retrieval)
Experience with relational, NoSQL, or NewSQL database systems and data modeling, structured and unstructured
Experience in parallel and distributed data processing techniques and platforms (MPI, Map/Reduce, Batch)
Experience in scripting languages and experience in debugging them, experience with high performance/systems languages and techniques
Knowledge of benchmark software development and programmable fields/systems, ability to analyze systems and data pipelines and propose solutions that leverage emerging technologies
Ability to use and integrate security controls for web applications, mobile platforms, and backend systems
Experience deploying reliable data systems and data quality management
Ability to research, evaluate, architect, and deploy new tools, frameworks, and patterns to build scalable Big Data platforms
Ability to document use cases, solutions and recommendations
Demonstrated excellence in written and verbal communication skills

Preferred

Experience with cloud platforms (e.g., GCP, AWS, or Azure) and distributed computing frameworks
Proficiency with version control systems (e.g., Git) and containerization tools (e.g., Docker)
Familiarity with healthcare data standards and regulatory requirements

Benefits

Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position.
The Cardinal at Work website provides detailed information on Stanford’s extensive range of benefits and rewards offered to employees.
The University will provide reasonable accommodations to applicants and employees with disabilities.

Company

Inside Higher Ed

twittertwittertwitter
company-logo
Inside Higher Ed is the online source for news, opinion, and jobs related to higher education.

Funding

Current Stage
Growth Stage
Total Funding
unknown
2022-01-10Acquired
2006-08-31Series Unknown

Leadership Team

leader-logo
Stephanie Shweiki
Director, Foundation Partnerships
linkedin
Company data provided by crunchbase