«`html
Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning
Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 effectively combines self-supervised learning from passive internet video with minimal robot interaction data to create a modular foundation for intelligent physical agents.
Scalable Self-Supervised Pretraining from 1M Hours of Video
V-JEPA 2 is pretrained on over 1 million hours of internet-scale video and 1 million images. Using a visual mask denoising objective, the model reconstructs masked spatiotemporal patches in a latent representation space. This approach enhances efficiency by concentrating on predictable scene dynamics while ignoring irrelevant noise.
Key Techniques for Scaling JEPA Pretraining
To achieve this scalability, Meta researchers implemented four key techniques:
- Data scaling: Constructed a 22M-sample dataset (VideoMix22M) from public sources.
- Model scaling: Expanded the encoder’s capacity to over 1B parameters using ViT-g.
- Training schedule: Adopted a progressive resolution strategy and extended pretraining to 252K iterations.
- Spatial-temporal augmentation: Trained on progressively longer and higher-resolution clips.
Performance Metrics
These design choices led to an average accuracy of 88.2% across six benchmark tasks, surpassing previous baselines.
Understanding via Masked Representation Learning
V-JEPA 2 demonstrates robust motion understanding capabilities, achieving 77.3% top-1 accuracy on the Something-Something v2 benchmark, outperforming models such as InternVideo and VideoMAEv2. For appearance understanding, the model remains competitive with state-of-the-art image-text pretraining models.
Temporal Reasoning via Video Question Answering
When assessing temporal reasoning, V-JEPA 2 aligns with a multimodal large language model and performs on various video question-answering tasks. The model’s accuracy includes:
- 84.0% on PerceptionTest
- 76.9% on TempCompass
- 44.5% on MVP
- 36.7% on TemporalBench
- 40.3% on TOMATO
These results illustrate that a pretrained video encoder can be effectively aligned post hoc, showcasing strong generalization capabilities.
Introducing V-JEPA 2-AC for Robotic Planning
A significant innovation is V-JEPA 2-AC, an action-conditioned variant of the pretrained encoder. Fine-tuned with only 62 hours of unlabeled robot video, this model predicts future embeddings based on robot actions. It achieves high success in tasks such as reaching, grasping, and pick-and-place without requiring reward supervision.
Benchmark Performance
V-JEPA 2-AC outperforms baselines such as Octo and Cosmos, executing plans in approximately 16 seconds per step and achieving a 100% success rate on reach tasks. It operates efficiently using a monocular RGB camera, evidencing the generalization capability of the learned world model.
Conclusion
Meta’s V-JEPA 2 marks a significant development in scalable self-supervised learning for physical intelligence, demonstrating that general-purpose visual representations can be harnessed for both perception and control in real-world applications.
Further Resources
Check out the Paper, Models on Hugging Face and GitHub Page. You can also follow us on Twitter and join our growing community on ML SubReddit with over 99k members.
«`