EchoTwin AI · 4 months ago
Vision Language Model Engineer
EchoTwin AI is pioneering AI-driven infrastructure intelligence, redefining how cities are managed. As a Vision Language Model Engineer, you will design, develop, and optimize advanced vision-language models that integrate visual and textual data to enable intelligent systems, collaborating with cross-functional teams to build applications such as image captioning and visual question answering.
Artificial Intelligence (AI)Big DataComputer VisionGenerative AIMachine LearningSmart Cities
Responsibilities
Design and implement state-of-the-art vision-language models using deep learning frameworks
Develop and fine-tune models that combine computer vision and natural language processing for tasks like image captioning, visual question answering, and text-to-image generation
Collaborate with data scientists and software engineers to integrate models into production systems
Optimize model performance for accuracy, latency, and scalability in real-world applications
Conduct experiments to evaluate model performance and iterate on architectures and training pipelines
Stay up-to-date with the latest research in vision-language models and incorporate advancements into projects
Contribute to data preprocessing, augmentation, and annotation pipelines for multimodal datasets
Document model development processes and present findings to technical and non-technical stakeholders
Qualification
Required
Bachelor's, Master's or Ph.D. in Computer Science, Machine Learning, Artificial Intelligence, or a related field (or equivalent experience)
3+ years of experience in machine learning, with a focus on vision-language models or multimodal AI
Hands-on experience with deep learning frameworks such as PyTorch or TensorFlow
Proven track record of building and deploying computer vision and/or NLP models
Proficiency in Python and relevant ML libraries (e.g., Hugging Face, OpenCV, Transformers)
Experience with large-scale model training and optimization (e.g., distributed training, quantization)
Strong understanding of neural network architectures (e.g., CNNs, Transformers, CLIP, or similar)
Experience with multimodal datasets and preprocessing techniques for images and text
Familiarity with cloud platforms (e.g., AWS, GCP, Azure) and model deployment workflows
Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
Excellent communication skills to explain complex technical concepts to diverse audiences
Benefits
Options for medical, dental, and vision coverage for employees and dependents (for US employees)
Flexible Spending Account (FSA) and Dependent Care Flexible Spending Account (DCFSA)
401(k) with 3% company matching
Unlimited PTO
Profit sharing
Company
EchoTwin AI
Transforming smart cities into cognitive cities that can see, think, & act.
Funding
Current Stage
Early StageTotal Funding
$8MKey Investors
Metis Ventures
2025-09-01Seed· $8M
Leadership Team
Recent News
2025-10-25
2025-10-20
Company data provided by crunchbase