Researchers Introduce MMLONGBENCH: A Comprehensive Benchmark for Long-Context Vision-Language Models
Recent advances in long-context (LC) modeling have significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). Long-context vision-language models (LCVLMs) are now capable of processing hundreds of images and thousands of interleaved text tokens in a single forward pass. However, the development of effective evaluation benchmarks for these models has not kept pace, leading to uncertainties regarding their performance in long-context settings, the tasks they struggle with, and their robustness to variations in input length.
Challenges with Current Benchmarks
Current evaluation benchmarks face several limitations:
- Limited coverage of downstream tasks
- Insufficient coverage of image types
- Lack of context length control
- Single context length evaluations
Various techniques have been developed to extend context windows for LVLMs, including longer pre-training lengths, position extrapolation, and efficient architectures. Notable models such as Gemini-2.5 and Qwen2.5-VL have incorporated these methods alongside vision token compression techniques to accommodate longer sequences.
Introducing MMLONGBENCH
Researchers from HKUST, Tencent AI Seattle Lab, University of Edinburgh, Miniml.AI, and NVIDIA AI Technology Center have proposed MMLONGBENCH, the first comprehensive benchmark for evaluating LCVLMs. This benchmark comprises 13,331 examples across five downstream task categories, including Visual RAG and Many-Shot In-Context Learning (ICL), while covering both natural and synthetic image types. All examples are standardized across five input lengths from 8K to 128K tokens using a cross-modal tokenization scheme that combines vision patches and text tokens.
The benchmarking process involved evaluating 46 closed-source and open-source models, revealing that single-task performance is a poor predictor of overall long-context capability. Both model types struggled with long-context tasks, although stronger reasoning models exhibited better performance.
Methodology and Evaluation
To construct long-context scenarios, researchers inserted gold passages containing answers among large sets of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT were utilized, while InfoSeek relied on lead sections from Wikipedia entity pages. Wikipedia pages were divided into 100-word passages, with distractors added until the desired input lengths were reached. Many-shot in-context learning tasks incorporated four diverse image classification datasets: Stanford Cars, Food101, SUN397, and iNat2021, accommodating 500 images within 128K context windows.
The evaluation on MMLONGBENCH across tasks and context lengths showed that all models faced challenges, with closed-source models generally performing better. For the maximum input length of 128K, all models struggled with long-context vision-language tasks, with GPT-4o achieving only a 62.9 average performance score. Gemini-2.5-Pro emerged as the top performer, surpassing open-source models by 20 points, except in ICL tasks. The Ovis2-34B model scored 41.6 on summarization, closely trailing GPT-4o’s score of 42.4. Additionally, Qwen2.5-VL-32B achieved a SubEM score of 64.6 on VRAG, outperforming Gemini-2.0-Flash.
Interestingly, models demonstrated generalization capabilities beyond their training context lengths, with Qwen2-VL-72B achieving a 51.9 average score at 128K despite a 32K training window.
Conclusion
The introduction of MMLONGBENCH marks a significant advancement in the evaluation of LCVLMs across diverse downstream tasks. This benchmark provides a rigorous foundation for diagnosing model capabilities, covering five distinct task categories with unified cross-modal token counting and standardized context lengths. The evaluation of 46 models indicates that single-task performance does not reliably predict overall long-context ability and highlights significant challenges in OCR accuracy and cross-modal retrieval faced by frontier models. MMLONGBENCH stands as a standard evaluation framework to guide future research toward more efficient vision-language token encodings, robust position-extrapolation schemes, and enhanced multi-modal retrieval and reasoning capabilities.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.