IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture
IBM has established a significant presence in the open-source AI ecosystem with its latest release of two embedding models: granite-embedding-english-r2 and granite-embedding-small-english-r2. These models are specifically designed for high-performance retrieval and RAG (retrieval-augmented generation) systems. Compact and efficient, they are licensed under Apache 2.0, making them suitable for commercial deployment.
What Models Did IBM Release?
The two models cater to different compute budgets:
- granite-embedding-english-r2: 149 million parameters, embedding size of 768, built on a 22-layer ModernBERT encoder.
- granite-embedding-small-english-r2: 47 million parameters, embedding size of 384, utilizing a 12-layer ModernBERT encoder.
Both models support a maximum context length of 8192 tokens, a significant improvement over the first-generation Granite embeddings, making them suitable for enterprise workloads involving long documents and complex retrieval tasks.
What’s Inside the Architecture?
Both models leverage the ModernBERT backbone, which includes several optimizations:
- Alternating global and local attention to balance efficiency with long-range dependencies.
- Rotary positional embeddings (RoPE) tuned for positional interpolation, enabling longer context windows.
- FlashAttention 2 to enhance memory usage and throughput during inference.
IBM trained these models using a multi-stage pipeline, beginning with masked language pretraining on a two-trillion-token dataset sourced from web content, Wikipedia, PubMed, BookCorpus, and internal IBM technical documents. This was followed by context extension from 1k to 8k tokens, contrastive learning with distillation from Mistral-7B, and domain-specific tuning for conversational, tabular, and code retrieval tasks.
How Do They Perform on Benchmarks?
The Granite R2 models exhibit strong performance across widely used retrieval benchmarks. On MTEB-v2 and BEIR, the larger granite-embedding-english-r2 outperforms similarly sized models such as BGE Base, E5, and Arctic Embed. The smaller model, granite-embedding-small-english-r2, achieves accuracy comparable to models two to three times larger, making it particularly appealing for latency-sensitive workloads.
Both models excel in specialized domains:
- Long-document retrieval (MLDR, LongEmbed) where 8k context support is critical.
- Table retrieval tasks (OTT-QA, FinQA, OpenWikiTables) requiring structured reasoning.
- Code retrieval (CoIR), effectively handling both text-to-code and code-to-text queries.
Are They Fast Enough for Large-Scale Use?
Efficiency is a standout feature of these models. On an Nvidia H100 GPU, the granite-embedding-small-english-r2 encodes nearly 200 documents per second, significantly faster than BGE Small and E5 Small. The larger granite-embedding-english-r2 also reaches 144 documents per second, outperforming many ModernBERT-based alternatives. Notably, these models remain practical for CPU use, allowing enterprises to deploy them in less GPU-intensive environments. This combination of speed, compact size, and retrieval accuracy makes them highly adaptable for real-world applications.
What Does This Mean for Retrieval in Practice?
IBM’s Granite Embedding R2 models demonstrate that effective embedding systems do not require massive parameter counts. They offer long-context support, benchmark-leading accuracy, and high throughput in compact architectures. For companies developing retrieval pipelines, knowledge management systems, or RAG workflows, Granite R2 presents a production-ready, commercially viable alternative to existing open-source options.
Summary
In summary, IBM’s Granite Embedding R2 models effectively balance compact design, long-context capability, and strong retrieval performance. With optimized throughput for both GPU and CPU environments, and an Apache 2.0 license enabling unrestricted commercial use, they offer a practical alternative to bulkier open-source embeddings. For enterprises deploying RAG, search, or large-scale knowledge systems, Granite R2 stands out as an efficient and production-ready option.
Check out the Paper, granite-embedding-small-english-r2, and granite-embedding-english-r2. Feel free to check out our GitHub Page for Tutorials, Codes, and Notebooks. Also, follow us on Twitter and join our 100k+ ML SubReddit, and subscribe to our Newsletter.