Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models

Multi-modal large language models (MLLMs) have shown significant progress as versatile AI assistants capable of managing various visual tasks. However, their impact is often limited when deployed as isolated digital entities. The integration of MLLMs into real-world applications such as robotics and autonomous vehicles necessitates advanced spatial understanding. Current MLLMs exhibit fundamental deficiencies in spatial reasoning, frequently struggling with basic tasks like distinguishing left from right.

Previous research has attributed these limitations to insufficient specialized training data, often addressing them through the incorporation of spatial data during training. However, these approaches tend to focus on single-image scenarios, restricting the model’s perception to static field-of-view analysis without incorporating dynamic information.

Advancements in Spatial Understanding

Several research methods have sought to overcome the spatial understanding limitations in MLLMs. They typically involve image encoders that convert visual inputs into tokens processed alongside text in the language model’s latent space. While earlier research has concentrated on single-image spatial understanding and the evaluation of inter-object spatial relations, benchmarks such as BLINK, UniQA-3D, and VSIBench have begun to extend beyond single images.

Recent improvements in MLLMs for spatial understanding include:

SpatialVLM: Fine-tunes models on curated spatial datasets.
SpatialRGPT: Incorporates mask-based references and depth images.
SpatialPIN: Utilizes specialized perception models without fine-tuning.

Introducing MultiSPA and Multi-SpatialMLLM

Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a framework to enhance MLLMs with multi-frame spatial understanding. This framework integrates three key components: depth perception, visual correspondence, and dynamic perception, effectively addressing the limitations of static single-image analysis.

They developed MultiSPA, a large-scale dataset comprising over 27 million samples across diverse 3D and 4D scenes. The Multi-SpatialMLLM model demonstrates significant improvements over baseline and proprietary systems, offering scalable and generalizable multi-frame reasoning capabilities.

To generate training data, five tasks were introduced:

Depth perception
Visual correspondence
Camera movement perception
Object movement perception
Object size perception

The MultiSPA data generation pipeline follows standard MLLM fine-tuning strategies, formatted as QA pairs: User: …{description}{question} and Assistant: {answer}. Researchers utilized GPT-4o to generate diverse templates for task descriptions, questions, and answers, along with high-quality annotated scene datasets, including:

4D datasets: Aria Digital Twin and Panoptic Studio
3D tracking annotations: TAPVid3D for object movement perception
ScanNet for other spatial tasks

MultiSPA generates over 27 million QA samples from 1.1 million unique images, with 300 samples reserved for each subtask evaluation, totaling 7,800 benchmark samples.

Performance Metrics

On the MultiSPA benchmark, Multi-SpatialMLLM achieved an average 36% gain over base models, reaching 80-90% accuracy on qualitative tasks compared to 50% for baseline models. It outperformed all proprietary systems, even achieving 18% accuracy on challenging tasks like predicting camera movement vectors, where other baselines showed near-zero performance.

On the BLINK benchmark, Multi-SpatialMLLM reached nearly 90% accuracy, with an average 26.4% improvement over base models, demonstrating transferable multi-frame spatial understanding. Standard VQA benchmark evaluations indicated rough parity with original performance, suggesting the model maintains general-purpose MLLM proficiency without overfitting to spatial reasoning tasks.

Conclusion

This research extends the spatial understanding of MLLMs to multi-frame scenarios, addressing a critical gap in previous investigations. By introducing MultiSPA, the first large-scale dataset and benchmark for multi-frame spatial reasoning tasks, the researchers validate the effectiveness, scalability, and strong generalization capabilities of the Multi-SpatialMLLM across diverse spatial understanding challenges. The findings reveal significant insights, including the benefits of multi-task learning and emergent behaviors in complex spatial reasoning, paving the way for new applications such as acting as a multi-frame reward annotator.

For further details, check out the Paper, Project Page, and GitHub Page. Follow us on Twitter and join our 95k+ ML SubReddit for more updates. Don’t forget to subscribe to our newsletter!