←back to Blog

VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents

Understanding the Target Audience

The target audience for VLM2Vec-V2 primarily includes researchers, data scientists, and business professionals in the fields of artificial intelligence and computer vision. These individuals are typically engaged in developing or implementing AI solutions that require advanced multimodal embedding techniques.

Pain Points

  • Limited performance of existing models on diverse visual data types.
  • Challenges in integrating various data modalities for comprehensive analysis.
  • Need for scalable solutions that can handle large datasets effectively.

Goals

  • To enhance the accuracy and efficiency of multimodal data retrieval.
  • To unify different types of visual data processing within a single framework.
  • To leverage advanced embedding models for practical applications in business and research.

Interests

  • Latest advancements in AI and machine learning technologies.
  • Innovative applications of computer vision in various industries.
  • Research findings that can inform better decision-making in AI projects.

Communication Preferences

The audience prefers clear, concise, and technical communication that includes data-driven insights and practical examples. They value peer-reviewed research and detailed specifications that can be directly applied to their work.

Overview of VLM2Vec-V2

Embedding models serve as bridges between different data modalities by encoding diverse multimodal information into a shared dense representation space. Recent advancements in embedding models have been propelled by progress in large foundation models. However, existing multimodal embedding models have primarily focused on datasets like MMEB and M-BEIR, which mainly include natural images and photographs from sources such as MSCOCO, Flickr, and ImageNet. This narrow focus results in underperformance on realistic tasks such as article searching, website searching, and YouTube video search.

Multimodal embedding benchmarks like MSCOCO, Flickr30K, and Conceptual Captions initially concentrated on static image-text pairs for tasks like image captioning and retrieval. More recent benchmarks, such as M-BEIR and MMEB, introduced multi-task evaluations but remain limited to static images and short contexts. Video representation learning has progressed through models like VideoCLIP and VideoCoCa, which integrate contrastive learning with captioning objectives. Visual document representation learning has advanced through models like ColPali and VisRAG, which utilize VLMs for document retrieval. However, none of these models unify image, video, and visual document retrieval within a single framework.

Researchers from Salesforce Research, UC Santa Barbara, University of Waterloo, and Tsinghua University have proposed VLM2Vec-V2 to address this gap. The model aims to unify image, video, and visual document retrieval within a single framework. Key developments include:

  • Creation of MMEB-V2, a benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering.
  • Development of VLM2Vec-V2 as a general-purpose embedding model that supports multiple input modalities while demonstrating strong performance on both newly introduced tasks and original image benchmarks.

Technical Specifications

VLM2Vec-V2 utilizes Qwen2-VL as its backbone, selected for its specialized capabilities in multimodal processing. Qwen2-VL offers three critical features that support unified embedding learning:

  • Naive Dynamic Resolution
  • Multimodal Rotary Position Embedding (M-RoPE)
  • A unified framework that combines 2D and 3D convolutions

To enable effective multi-task training across diverse data sources, VLM2Vec-V2 introduces a flexible data sampling pipeline with two key components:

  • On-the-fly batch mixing based on predefined sampling weight tables that control the relative probabilities of each dataset.
  • An interleaved sub-batching strategy that splits full batches into independently sampled sub-batches, improving the stability of contrastive learning.

Performance Evaluation

VLM2Vec-V2 achieves the highest overall average score of 58.0 across 78 datasets covering image, video, and visual document tasks, outperforming strong baselines including GME, LamRA, and VLM2Vec built on the same Qwen2-VL backbone. On image tasks, VLM2Vec-V2 significantly outperforms most baselines and achieves performance comparable to VLM2Vec-7B despite being only 2B parameters in size. For video tasks, the model achieves competitive performance despite training on relatively small amounts of video data. In visual document retrieval, VLM2Vec-V2 outperforms all VLM2Vec variants but still lags behind ColPali, which is specifically optimized for visual document tasks.

Conclusion

VLM2Vec-V2 represents a strong baseline model trained through contrastive learning across diverse tasks and modality combinations. Built upon MMEB-V2 and utilizing Qwen2-VL as its backbone model, VLM2Vec-V2 establishes a foundation for more scalable and flexible representation learning in both research and practical applications. The experimental evaluation demonstrates its effectiveness in achieving balanced performance across multiple modalities while highlighting the diagnostic value of MMEB-V2 for future research.

Further Reading

For more information, check out the Paper, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.