Vision Language Model Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

EchoTwin AI · 2 months ago

Vision Language Model Engineer

EchoTwin AI is pioneering AI-driven infrastructure intelligence, redefining how cities are managed. As a Vision Language Model Engineer, you will design, develop, and optimize advanced vision-language models that integrate visual and textual data to enable intelligent systems, working closely with cross-functional teams to build models for applications like image captioning and visual question answering.

Artificial Intelligence (AI)Big DataComputer VisionGenerative AIMachine LearningSmart Cities

Responsibilities

Design and implement state-of-the-art vision-language models using deep learning frameworks
Develop and fine-tune models that combine computer vision and natural language processing for tasks like image captioning, visual question answering, and text-to-image generation
Collaborate with data scientists and software engineers to integrate models into production systems
Optimize model performance for accuracy, latency, and scalability in real-world applications
Conduct experiments to evaluate model performance and iterate on architectures and training pipelines
Stay up-to-date with the latest research in vision-language models and incorporate advancements into projects
Contribute to data preprocessing, augmentation, and annotation pipelines for multimodal datasets
Document model development processes and present findings to technical and non-technical stakeholders

Qualification

Vision-language modelsDeep learning frameworksComputer visionNatural language processingPythonLarge-scale model trainingNeural network architecturesCloud platformsProblem-solving skillsCommunication skills

Required

Bachelor's, Master's or Ph.D. in Computer Science, Machine Learning, Artificial Intelligence, or a related field (or equivalent experience)
3+ years of experience in machine learning, with a focus on vision-language models or multimodal AI
Hands-on experience with deep learning frameworks such as PyTorch or TensorFlow
Proven track record of building and deploying computer vision and/or NLP models
Proficiency in Python and relevant ML libraries (e.g., Hugging Face, OpenCV, Transformers)
Experience with large-scale model training and optimization (e.g., distributed training, quantization)
Strong understanding of neural network architectures (e.g., CNNs, Transformers, CLIP, or similar)
Experience with multimodal datasets and preprocessing techniques for images and text
Familiarity with cloud platforms (e.g., AWS, GCP, Azure) and model deployment workflows
Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
Excellent communication skills to explain complex technical concepts to diverse audiences

Benefits

Options for medical, dental, and vision coverage for employees and dependents (for US employees)
Flexible Spending Account (FSA) and Dependent Care Flexible Spending Account (DCFSA)
401(k) with 3% company matching
Unlimited PTO
Profit sharing

Company

EchoTwin AI

twittertwittertwitter
company-logo
Transforming smart cities into cognitive cities that can see, think, & act.

Funding

Current Stage
Early Stage
Total Funding
$8M
Key Investors
Metis Ventures
2025-09-01Seed· $8M

Leadership Team

leader-logo
Chris Carson
Founder | Global CEO | Chairman of the Board
linkedin
leader-logo
Michael Byrne
Chief Product Officer
linkedin
Company data provided by crunchbase