Apply on Employer Site

EchoTwin AI · 4 months ago

Vision Language Model Engineer

San Francisco Office

Full-time

Onsite

Mid Level

3+ years exp

EchoTwin AI is pioneering AI-driven infrastructure intelligence, redefining how cities are managed. As a Vision Language Model Engineer, you will design, develop, and optimize advanced vision-language models that integrate visual and textual data to enable intelligent systems, collaborating with cross-functional teams to build applications such as image captioning and visual question answering.

Artificial Intelligence (AI)Big DataComputer VisionGenerative AIMachine LearningSmart Cities

Responsibilities

Design and implement state-of-the-art vision-language models using deep learning frameworks

Develop and fine-tune models that combine computer vision and natural language processing for tasks like image captioning, visual question answering, and text-to-image generation

Collaborate with data scientists and software engineers to integrate models into production systems

Optimize model performance for accuracy, latency, and scalability in real-world applications

Conduct experiments to evaluate model performance and iterate on architectures and training pipelines

Stay up-to-date with the latest research in vision-language models and incorporate advancements into projects

Contribute to data preprocessing, augmentation, and annotation pipelines for multimodal datasets

Document model development processes and present findings to technical and non-technical stakeholders

Qualification

Vision-language modelsDeep learning frameworksComputer visionNatural language processingPythonLarge-scale model trainingNeural network architecturesCloud platformsProblem-solving skillsCommunication skills

Required

Bachelor's, Master's or Ph.D. in Computer Science, Machine Learning, Artificial Intelligence, or a related field (or equivalent experience)

3+ years of experience in machine learning, with a focus on vision-language models or multimodal AI

Hands-on experience with deep learning frameworks such as PyTorch or TensorFlow

Proven track record of building and deploying computer vision and/or NLP models

Proficiency in Python and relevant ML libraries (e.g., Hugging Face, OpenCV, Transformers)

Experience with large-scale model training and optimization (e.g., distributed training, quantization)

Strong understanding of neural network architectures (e.g., CNNs, Transformers, CLIP, or similar)

Experience with multimodal datasets and preprocessing techniques for images and text

Familiarity with cloud platforms (e.g., AWS, GCP, Azure) and model deployment workflows

Strong problem-solving skills and ability to work in a fast-paced, collaborative environment

Excellent communication skills to explain complex technical concepts to diverse audiences

Benefits

Options for medical, dental, and vision coverage for employees and dependents (for US employees)

Flexible Spending Account (FSA) and Dependent Care Flexible Spending Account (DCFSA)

401(k) with 3% company matching

Unlimited PTO

Profit sharing

Company

EchoTwin AI

Transforming smart cities into cognitive cities that can see, think, & act.

Founded in 2024

Dubai, Dubai, ARE

11-50 employees

http://www.echotwin.ai

Funding

Current Stage

Early Stage

Total Funding

$8M

Key Investors

Metis Ventures

2025-09-01Seed· $8M

Leadership Team

Chris Carson

Founder | Global CEO | Chairman of the Board

Michael Byrne

Chief Product Officer

Recent News

Sun Sentinel

Boca Raton sees growth with new companies as office campus draws even more tenants

2025-10-25

The Real Deal

Lease roundup: CP Group’s Boca Raton office campus scores over 260K sf in deals

2025-10-20

Business Wire

CP Group Announces Over 266,000 SF in Leasing Activity at Boca Raton Innovation Campus (BRiC)

2025-10-17

Company data provided by crunchbase