MeshyAI ยท 2 months ago
Data Infrastructure Engineer
Meshy is a leading 3D generative AI company headquartered in Silicon Valley, focused on transforming the content creation pipeline. The Data Infrastructure Engineer will design, build, and operate distributed data systems for large-scale data ingestion and processing, ensuring data quality and scalability while collaborating with ML researchers.
Computer Software
Responsibilities
Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries)
Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics
Architect pipelines across cloud object storage (S3, GCS, Azure Blob), data lakes, and metadata catalogs
Optimize large-scale processing with distributed frameworks (Spark, Dask, Ray, Flink, or equivalents)
Implement partitioning, sharding, caching strategies, and observability (monitoring, logging, alerting) for reliable pipelines
Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction
Implement validation and quality checks to ensure datasets meet ML training requirements
Collaborate with ML researchers to quickly adapt pipelines to evolving pretraining and evaluation needs
Use infrastructure-as-code (Terraform, Kubernetes, etc.) to manage scalable and reproducible environments
Integrate CI/CD best practices for data workflows
Maintain data lineage, reproducibility, and governance for datasets used in AI/ML pipelines
Work cross-functionally with ML researchers, graphics/vision engineers, and platform teams
Embrace versatility: switch between infrastructure-level challenges and asset/data-level problem solving
Contribute to a culture of fast iteration, pragmatic trade-offs, and collaborative ownership
Qualification
Required
5+ years of experience in data engineering, distributed systems, or similar
Strong programming skills in Python (plus Scala/Java/C++ a plus)
Solid skills in SQL for analytics, transformations, and warehouse/lakehouse integration
Proficiency with distributed frameworks (Spark, Dask, Ray, Flink)
Familiarity with cloud platforms (AWS/GCP/Azure) and storage systems (S3, Parquet, Delta Lake, etc.)
Experience with workflow orchestration tools (Airflow, Prefect, Dagster)
Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries)
Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics
Architect pipelines across cloud object storage (S3, GCS, Azure Blob), data lakes, and metadata catalogs
Optimize large-scale processing with distributed frameworks (Spark, Dask, Ray, Flink, or equivalents)
Implement partitioning, sharding, caching strategies, and observability (monitoring, logging, alerting) for reliable pipelines
Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction
Implement validation and quality checks to ensure datasets meet ML training requirements
Collaborate with ML researchers to quickly adapt pipelines to evolving pretraining and evaluation needs
Use infrastructure-as-code (Terraform, Kubernetes, etc.) to manage scalable and reproducible environments
Integrate CI/CD best practices for data workflows
Maintain data lineage, reproducibility, and governance for datasets used in AI/ML pipelines
Work cross-functionally with ML researchers, graphics/vision engineers, and platform teams
Embrace versatility: switch between infrastructure-level challenges and asset/data-level problem solving
Contribute to a culture of fast iteration, pragmatic trade-offs, and collaborative ownership
Preferred
Experience handling large-scale unstructured datasets (images, video, binaries, or 3D/2D assets)
Familiarity with AI/ML training data pipelines, including dataset versioning, augmentation, and sharding
Exposure to computer graphics or 3D/2D data processing is strongly preferred
Kubernetes for distributed workloads and orchestration
Data warehouses or lakehouse platforms (Snowflake, BigQuery, Databricks, Redshift)
Familiarity with GPU-accelerated computing and HPC clusters
Experience with 3D/2D asset processing (geometry transformations, rendering pipelines, texture handling)
Rendering engines (Blender, Unity, Unreal) for synthetic data generation
Open-source contributions in ML infrastructure, distributed systems, or data platforms
Familiarity with secure data handling and compliance
Benefits
Competitive salary, equity, and benefits package.
Flexible work environment, with options for remote and on-site work.
Opportunities for fast professional growth and development.
An inclusive culture that values creativity, innovation, and collaboration.
Unlimited, flexible time off.
401(k) plan for employees.
Comprehensive health, dental, and vision insurance.
The latest and best office equipment.