←back to Blog

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model

Introduction to Multimodal Modeling

Multimodal modeling aims to create systems that can understand and generate content across visual and textual formats. These models interpret visual scenes and produce new images based on natural language prompts. The integration of image recognition and generation capabilities into a unified system enhances coherent interactions across modalities.

Challenges in Multimodal Systems

A significant challenge in this field is developing architectures that can handle both understanding and generation without compromising quality. Models must grasp complex visual concepts and produce high-quality images that align with user prompts. This requires a careful balance between semantic understanding and pixel-level synthesis.

Previous Approaches

Traditionally, models have utilized Variational Autoencoders (VAEs) or CLIP-based encoders for image representation. While VAEs are efficient for reconstruction, they often yield less informative representations. In contrast, CLIP-based encoders provide high-level semantic embeddings but are not optimized for image reconstruction, complicating their use for generation without additional models like diffusion decoders. Training methods like Mean Squared Error (MSE) are common but can lead to deterministic outputs. To enhance diversity and quality in generation, researchers have explored Flow Matching, which introduces controlled stochasticity to better model continuous image features.

Introducing BLIP3-o

Researchers from Salesforce Research, in collaboration with the University of Maryland and other academic institutions, have introduced BLIP3-o, a family of unified multimodal models. The model employs a dual-stage training strategy, first focusing on image understanding and then on image generation. It leverages CLIP embeddings for image representation and integrates them with a diffusion transformer for synthesizing new visual outputs. This sequential approach preserves the strengths of each task independently, avoiding interference.

Technical Specifications

The diffusion module is trained while keeping the autoregressive backbone frozen, which enhances alignment and visual fidelity. The team curated BLIP3o-60k, a high-quality instruction-tuning dataset generated by prompting GPT-4o across various visual categories, including scenes, objects, gestures, and text. Two model versions were developed: an 8-billion parameter model trained with proprietary and public data, and a 4-billion version using only open-source data.

Image Generation Pipeline

The image generation pipeline of BLIP3-o is built on Qwen2.5-VL large language models. Prompts are processed to produce visual features refined through a Flow Matching diffusion transformer based on the Lumina-Next architecture. This model encodes each image into 64 fixed-length semantic vectors, regardless of resolution, facilitating compact storage and efficient decoding. The training utilized a large-scale dataset of 25 million images from sources like CC12M, SA-1B, and JourneyDB, supplemented with 30 million proprietary samples for the 8B model. Additionally, 60k instruction-tuning samples were included to cover complex prompts generated via GPT-4o.

Performance Metrics

BLIP3-o has demonstrated top scores across multiple benchmarks. The 8B model achieved a GenEval score of 0.84 for image generation alignment and a WISE score of 0.62 for reasoning ability. In image understanding, it scored 1682.6 on MME-Perception, 647.1 on MME-Cognition, 50.6 on MMMU, and 83.1 on both VQAv2 and TextVQA datasets. A human evaluation comparing BLIP3-o 8B with Janus Pro 7B indicated that BLIP3-o was preferred 50.4% of the time for visual quality and 51.5% for prompt alignment, supported by statistically significant p-values (5.05e-06 and 1.16e-05).

Conclusion

This research presents a clear solution to the dual challenge of image understanding and generation. The integration of CLIP embeddings, Flow Matching, and a sequential training strategy illustrates a methodical approach to multimodal modeling. The BLIP3-o model not only delivers state-of-the-art results but also introduces an efficient and open approach to unified multimodal systems.

Check out the Paper, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and join our 90k+ ML SubReddit.

External illustration