Abaka AI · 2 weeks ago
Data Engineer (Web Data)
Abaka AI is built on one mission: to be the world’s most trusted data partner for AI companies. They are seeking a Data Engineer (Web Data) focused on Web Crawling to design, build, and maintain robust crawling infrastructure that supports large-scale data collection for multimodal AI systems.
Data Collection and LabelingMachine LearningNatural Language Processing
Responsibilities
Collaborate closely with clients to understand their data requirements, and coordinate internal teams to create tailored delivery plans that ensure on-time, high-quality data delivery, including meeting expectations for format, precision, and volume
Lead the development of mid- to long-term plans for the data engineering function. Build scalable, end-to-end pipelines for multimodal data (text, image, audio, video, 3D point cloud, etc.), covering data sourcing, cleaning, annotation, QA, storage, and iterative optimization for training, fine-tuning, and evaluation
Develop solutions to core technical challenges in multimodal data processing, such as cross-modal alignment (for example, image-text semantic matching), large-scale data cleaning (deduplication, denoising, format normalization), annotation efficiency, and data encryption and security
Work cross-functionally with algorithm, product, and business teams by providing feedback to model teams on data bottlenecks, helping refine internal tools and services, and supporting client-facing teams with technical documentation and pre-sales materials
Evaluate and optimize the cost structure of data processing operations, including headcount, infrastructure, and tooling, to balance quality, efficiency, and scalability
Qualification
Required
Strong background in computer science, data engineering, artificial intelligence, or related fields, with hands-on experience working with large-scale data systems
3+ years of experience in data engineering or data operations. Leadership experience is highly valued, and prior involvement in LLM or multimodal dataset preparation is a strong plus
Must-have technical skills: Strong Python proficiency; HTML/DOM parsing (lxml, XPath); HTTP internals; advanced Scrapy; async crawling (aiohttp/asyncio); Playwright/Selenium; familiarity with browser internals
Deep understanding of end-to-end multimodal data workflows, with practical experience in at least two modalities, such as text, images, audio, or video
Proficiency in designing technical architectures for large-scale data pipelines, including distributed processing and automation frameworks. Familiarity with data privacy and security best practices such as access control and data anonymization
Strong execution and team management skills, with the ability to translate high-level objectives into actionable plans and drive team results
Excellent communication and cross-functional collaboration skills, with the ability to clearly communicate technical and operational requirements, resolve conflicts, and manage stakeholder expectations
High sense of ownership and resilience, with comfort operating in a fast-paced, evolving AI environment and the ability to navigate urgent delivery timelines
Benefits
Equity
Comprehensive benefits package
Health
Dental
Vision
PTO
Flexible work schedule
Company
Abaka AI
Abaka AI is a leading AI company and we are committed to becoming the data partner in artificial intelligence industry.
H1B Sponsorship
Abaka AI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
Funding
Current Stage
Growth StageCompany data provided by crunchbase