Alibaba AI Team Just Released Ovis 2.5 Multimodal LLMs: A Major Leap in Open-Source AI with Enhanced Visual Perception and Reasoning Capabilities

Understanding the Target Audience

The primary audience for Ovis 2.5 includes AI researchers, data scientists, and business managers interested in leveraging advanced AI technologies. Their pain points often involve:

Challenges in processing high-detail visual information.
Limitations of existing models in complex reasoning tasks.
Resource constraints for deploying AI solutions on mobile and edge devices.

Their goals include enhancing productivity through improved AI capabilities and staying competitive in a rapidly evolving technological landscape. Interests often revolve around open-source solutions, technical advancements, and practical applications in various domains. Communication preferences lean towards detailed technical documentation, peer-reviewed studies, and engaging community discussions on platforms like Reddit and GitHub.

Overview of Ovis 2.5

Ovis 2.5, the latest large multimodal language model (MLLM) from Alibaba’s AIDC-AI team, features 9B and 2B parameter variants. It introduces major enhancements in:

Native-resolution vision perception
Deep multimodal reasoning
Robust Optical Character Recognition (OCR)

These advancements address long-standing limitations faced by MLLMs, especially in processing intricate visual data and performing complex reasoning tasks.

Native-Resolution Vision and Deep Reasoning

A key innovation in Ovis 2.5 is its use of a native-resolution vision transformer (NaViT). This technology allows the model to process images at their original, variable resolutions, preserving the integrity of detailed visuals. This upgrade enhances performance on tasks involving:

Scientific diagrams
Complex infographics
Detailed forms

To improve reasoning capabilities, Ovis 2.5 employs a curriculum that incorporates «thinking-style» samples for self-correction and reflection. Users can activate an optional “thinking mode” during inference to enhance accuracy on tasks requiring deep multimodal analysis, such as scientific question answering or mathematical problem solving.

Performance Benchmarks and Results

Ovis 2.5-9B achieves an average score of 78.3 on the OpenCompass multimodal leaderboard, outperforming all open-source MLLMs under 40B parameters. Ovis 2.5-2B scores 73.9, creating a new standard for lightweight models suited for on-device or resource-constrained inference. Both variants excel in:

STEM reasoning (MathVista, MMMU, WeMath)
OCR and chart analysis (OCRBench v2, ChartQA Pro)
Visual grounding (RefCOCO, RefCOCOg)
Video and multi-image comprehension (BLINK, VideoMME)

Technical discussions on platforms like Reddit highlight significant improvements in OCR and document processing, particularly in extracting text from cluttered images and understanding complex visual queries.

High-Efficiency Training and Scalable Deployment

Ovis 2.5 enhances training efficiency through multimodal data packing and advanced hybrid parallelism, achieving a 3–4× speedup in overall throughput. Its lightweight 2B variant aligns with the «small model, big performance» philosophy, enabling high-quality multimodal understanding on mobile hardware and edge devices.

Conclusion

Alibaba’s Ovis 2.5 models represent a significant advancement in open-source multimodal AI, achieving state-of-the-art scores on the OpenCompass leaderboard for models under 40B parameters. Key innovations include:

A native-resolution vision transformer for processing high-detail visuals
An optional “thinking mode” for enhanced self-reflective reasoning

Ovis 2.5 outperforms previous models in STEM, OCR, chart analysis, and video understanding, making advanced multimodal capabilities accessible for both researchers and applications with resource constraints.

Explore the Technical Paper and Models on Hugging Face. Visit our GitHub Page for Tutorials, Codes, and Notebooks. Follow us on Twitter and join our 100k+ ML SubReddit. Don’t forget to subscribe to our Newsletter.