Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry
A team of researchers from Meta Reality Labs and Carnegie Mellon University has introduced MapAnything, an end-to-end transformer architecture that directly regresses factored metric 3D scene geometry from images and optional sensor inputs. Released under Apache 2.0 with full training and benchmarking code, MapAnything supports over 12 distinct 3D vision tasks in a single feed-forward pass, advancing beyond traditional specialist pipelines.
Understanding the Target Audience
The primary audience for this research includes:
- AI researchers and practitioners specializing in computer vision and 3D reconstruction.
- Data scientists and machine learning engineers seeking to implement advanced models in their projects.
- Business leaders in industries such as robotics, gaming, and augmented reality looking to leverage cutting-edge technology for competitive advantage.
Key pain points for this audience include:
- Complexity and inefficiency of existing modular solutions for 3D reconstruction.
- Challenges in integrating multiple data sources and optimizing performance across various tasks.
- The need for scalable and robust models that can adapt to diverse input configurations.
Goals for the audience involve:
- Improving the accuracy and efficiency of 3D modeling processes.
- Reducing the time and resources spent on tuning and optimizing models.
- Implementing state-of-the-art solutions that can handle a wide array of tasks seamlessly.
Interests include:
- Latest advancements in AI and machine learning technologies.
- Open-source tools and frameworks that facilitate rapid development and deployment.
- Real-world applications of 3D reconstruction in various sectors.
Communication preferences lean towards:
- Technical documentation and peer-reviewed publications for in-depth understanding.
- Webinars, tutorials, and community forums for practical insights and peer interaction.
Why a Universal Model for 3D Reconstruction?
Image-based 3D reconstruction has historically relied on fragmented pipelines involving feature detection, two-view pose estimation, bundle adjustment, multi-view stereo, and monocular depth inference. While effective, these modular solutions require task-specific tuning and optimization.
MapAnything addresses these challenges by:
- Accepting up to 2,000 input images in a single inference run.
- Utilizing auxiliary data such as camera intrinsics, poses, and depth maps.
- Producing direct metric 3D reconstructions without the need for bundle adjustment.
The model’s factored scene representation—composed of ray maps, depth, poses, and a global scale factor—provides modularity and generality unmatched by prior approaches.
Architecture and Representation
MapAnything employs a multi-view alternating-attention transformer. Each input image is encoded with DINOv2 ViT-L features, while optional inputs (rays, depth, poses) are encoded into the same latent space via shallow CNNs or MLPs. A learnable scale token enables metric normalization across views.
The network outputs a factored representation:
- Per-view ray directions (camera calibration).
- Depth along rays, predicted up-to-scale.
- Camera poses relative to a reference view.
- A single metric scale factor converting local reconstructions into a globally consistent frame.
This explicit factorization avoids redundancy, allowing the same model to handle monocular depth estimation, multi-view stereo, structure-from-motion (SfM), or depth completion without specialized heads.
Training Strategy
MapAnything was trained across 13 diverse datasets spanning indoor, outdoor, and synthetic domains, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++, and TartanAirV2. Two variants are released:
- Apache 2.0 licensed model trained on six datasets.
- CC BY-NC model trained on all thirteen datasets for stronger performance.
Key training strategies include:
- Probabilistic input dropout: During training, geometric inputs (rays, depth, pose) are provided with varying probabilities, enhancing robustness across heterogeneous configurations.
- Covisibility-based sampling: Ensures input views have meaningful overlap, supporting reconstruction up to 100+ views.
- Factored losses in log-space: Depth, scale, and pose are optimized using scale-invariant and robust regression losses to improve stability.
Training was performed on 64 H200 GPUs with mixed precision, gradient checkpointing, and curriculum scheduling, scaling from 4 to 24 input views.
Benchmarking Results
MapAnything achieves state-of-the-art (SoTA) performance across multiple benchmarks:
- Multi-View Dense Reconstruction: On ETH3D, ScanNet++ v2, and TartanAirV2-WB, it surpasses baselines like VGGT and Pow3R.
- Two-View Reconstruction: Consistently outperforms DUSt3R, MASt3R, and Pow3R across scale, depth, and pose accuracy.
- Single-View Calibration: Achieves an average angular error of 1.18°, outperforming AnyCalib (2.01°) and MoGe-2 (1.95°).
- Depth Estimation: Sets new SoTA for multi-view metric depth estimation on the Robust-MVD benchmark.
Overall, benchmarks confirm a 2× improvement over prior SoTA methods in many tasks, validating the benefits of unified training.
Key Contributions
The research team highlights four major contributions:
- Unified Feed-Forward Model capable of handling more than 12 problem settings, from monocular depth to SfM and stereo.
- Factored Scene Representation enabling explicit separation of rays, depth, pose, and metric scale.
- State-of-the-Art Performance across diverse benchmarks with fewer redundancies and higher scalability.
- Open-Source Release including data processing, training scripts, benchmarks, and pretrained weights under Apache 2.0.
Conclusion
MapAnything establishes a new benchmark in 3D vision by unifying multiple reconstruction tasks—SfM, stereo, depth estimation, and calibration—under a single transformer model with a factored scene representation. It not only outperforms specialist methods across benchmarks but also adapts seamlessly to heterogeneous inputs, including intrinsics, poses, and depth. With open-source code, pretrained models, and support for over 12 tasks, MapAnything lays the groundwork for a truly general-purpose 3D reconstruction backbone.
Check out the Paper, Codes, and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes, and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.