MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

Vision-language models (VLMs) are essential for multimodal AI systems, allowing autonomous agents to comprehend visual environments, reason across different content types, and interact with both digital and physical realms. The introduction of MiMo-VL-7B by researchers from Xiaomi marks a significant advancement in this field. This model consists of three main components: a native-resolution Vision Transformer encoder that captures fine-grained visual details, a Multi-Layer Perceptron projector for effective cross-modal alignment, and the MiMo-7B language model designed for complex reasoning tasks.

Training Methodology

MiMo-VL-7B undergoes a two-step training process. The first phase is a four-stage pre-training process that includes:

Projector warmup
Vision-language alignment
General multimodal pre-training
Long-context supervised fine-tuning

This phase utilizes 2.4 trillion tokens from high-quality datasets, resulting in the MiMo-VL-7B-SFT model. The second phase involves post-training, which employs Mixed On-policy Reinforcement Learning (MORL) to integrate various reward signals, including perception accuracy, visual grounding precision, logical reasoning capabilities, and human preferences. This results in the MiMo-VL-7B-RL model.

Model Architecture

The MiMo-VL-7B architecture comprises:

A Vision Transformer (ViT) for encoding visual inputs such as images and videos
A projector that maps visual encodings into a latent space aligned with the language model
The language model itself, which handles textual understanding and reasoning

The model employs the Qwen2.5-ViT as a visual encoder to support native resolution inputs, while the MiMo-7B-Base serves as the backbone for strong reasoning capabilities. The pre-training dataset includes diverse multimodal data, image captions, Optical Character Recognition (OCR) data, grounding data, video content, GUI interactions, reasoning examples, and text-only sequences.

Post-Training Enhancements

The post-training phase further refines MiMo-VL-7B for challenging reasoning tasks and aligns it with human preferences through the MORL framework. This framework combines Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning from Human Feedback (RLHF). RLVR employs rule-based reward functions for continuous self-improvement, while RLHF addresses human preference alignment to reduce undesirable behaviors. MORL optimizes both RLVR and RLHF objectives simultaneously.

Performance Evaluation

Comprehensive evaluations across 50 tasks demonstrate that MiMo-VL-7B achieves state-of-the-art performance among open-source models. Key findings include:

MiMo-VL-7B-SFT and MiMo-VL-7B-RL achieving 64.6% and 66.7% on MMMUval, respectively, outperforming larger models like Gemma 3 27B.
MiMo-VL-7B-RL excelling in document understanding with 56.5% on CharXivRQ, surpassing Qwen2.5-VL by 14.0 points and InternVL3 by 18.9 points.
Substantial improvements in multimodal reasoning tasks, with MiMo-VL-7B-SFT outperforming larger models, including Qwen2.5-VL-72B and QVQ-72B-Preview.
Boosting MathVision accuracy from 57.9% to 60.4% with the RL variant.

MiMo-VL-7B also demonstrates exceptional GUI understanding and grounding capabilities, outperforming all compared general VLMs and achieving comparable performance to GUI-specialized models on benchmarks like Screenspot-Pro and OSWorld-G. The model holds the highest Elo rating among evaluated open-source VLMs, ranking first across models with 7B to 72B parameters and closely approaching proprietary models.

Conclusion

The MiMo-VL-7B models achieve state-of-the-art performance through curated, high-quality pre-training datasets and the MORL framework. Key insights from this research include:

Consistent performance gains from incorporating reasoning data in later pre-training stages
The advantages of on-policy RL over vanilla GRPO
Challenges of task interference when applying MORL across diverse capabilities

The researchers have open-sourced a comprehensive evaluation suite to promote transparency and reproducibility in multimodal research, advancing capable open-source vision-language models and providing valuable insights for the community.

Check out the Paper, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.