«`html

Meta CLIP 2: The First Contrastive Language-Image Pre-training (CLIP) Trained with Worldwide Image-Text Pairs from Scratch

Contrastive Language-Image Pre-training (CLIP) has become crucial for modern vision and multimodal models, enabling applications such as zero-shot image classification and serving as vision encoders in Multimodal Large Language Models (MLLMs). However, most CLIP variants, including Meta CLIP, are limited to English-only data curation, ignoring a significant amount of non-English content from the worldwide web.

Scaling CLIP to include multilingual data presents two main challenges: the lack of an efficient method to curate non-English data at scale, and the decline of English performance when adding multilingual data, known as the curse of multilinguality. These issues hinder the development of unified models optimized for both English and non-English tasks.

Methods like OpenAI CLIP and Meta CLIP rely on English-centric curation, while distillation-based approaches introduce biases from external teacher models. Multilingual CLIP models, such as M-CLIP and mCLIP, adopt distillation techniques but use low-quality data for training multilingual text encoders. Hybrid methods like SLIP and LiT combine language supervision with self-supervised learning (SSL) to balance semantic alignment and visual representation. Despite these efforts, none of the methods have resolved the core issues.

Researchers from Meta, MIT, Princeton University, and New York University have proposed Meta CLIP 2, the first method to train CLIP models from scratch using native worldwide image-text pairs without relying on external resources like private data, machine translation, or distillation. This approach removes the performance trade-offs between English and non-English data by jointly scaling metadata, data curation, model capacity, and training.

Meta CLIP 2 maximizes compatibility with OpenAI CLIP’s architecture, ensuring generalizability to CLIP and its variants. Its innovations for scaling to worldwide data include:

Scalable metadata across 300+ languages
A per-language curation algorithm for balanced concept distribution
An advanced training framework

To address the first challenge, researchers used globally curated data, and to tackle the second, they developed a worldwide CLIP training framework that follows OpenAI and Meta CLIP’s training settings and model architecture. This framework includes:

A multilingual text tokenizer
Scaling of seen training pairs
An analysis of minimal viable model capacity

The training setup uses OpenAI CLIP’s ViT-L/14 and Meta CLIP’s ViT-H/14 models, with modifications for multilingual support. Studies on minimal model expressivity reveal that even OpenAI’s ViT-L/14 struggles with the curse due to limited capacity, whereas ViT-H/14 serves as an inflection point, achieving notable gains in both English and non-English tasks.

Meta CLIP 2 outperforms its English-only (1.0×) and non-English (1.3×) counterparts in both English and multilingual tasks when trained on ViT-H/14 with worldwide data and scaled seen pairs. However, the curse persists in non-scaled settings or with smaller models like ViT-L/14. Transitioning from English-centric metadata to worldwide equivalents is essential. For instance, removing the English filter on alt-texts leads to a 0.6% drop in ImageNet accuracy, highlighting the role of language isolation.

Replacing English metadata with merged worldwide metadata initially lowers English performance but boosts multilingual capabilities. Evaluations on zero-shot classification and few-shot geo-localization benchmarks show that scaling from 13B English to 29B worldwide pairs improves results, except for saturated performance in GeoDE.

In conclusion, researchers introduced Meta CLIP 2, the first CLIP model trained from scratch on worldwide image-text pairs. It demonstrates that scaling metadata, curation, and training capacity can break the “curse of multilinguality,” enabling mutual benefits for English and non-English performance. Meta CLIP 2 (ViT-H/14) outperforms its English-only counterpart on zero-shot ImageNet (80.5% → 81.3%) and excels on multilingual benchmarks such as XM3600, Babel-IN, and CVQA with a single unified model. By open-sourcing its metadata, curation methods, and training code, Meta CLIP 2 empowers the research community to move beyond English-centric approaches and embrace the potential of the worldwide multimodal web.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes, and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

«`