←back to Blog

IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance

IBM Releases Granite 4.0 Models with Novel Hybrid Mamba-2/Transformer Architecture

IBM has introduced Granite 4.0, an open-source family of large language models (LLMs) that utilizes a hybrid Mamba-2/Transformer architecture. This innovative design significantly reduces memory usage while maintaining performance quality. The models include:

  • Granite-4.0-H-Small: 32B total, ~9B active (hybrid MoE)
  • Granite-4.0-H-Tiny: 7B total, ~1B active (hybrid MoE)
  • Granite-4.0-H-Micro: 3B (hybrid dense)
  • Granite-4.0-Micro: 3B (dense Transformer for stacks that don’t yet support hybrids)

All models are released under the Apache-2.0 license and are cryptographically signed. Notably, Granite 4.0 is the first open model family to receive accredited ISO/IEC 42001:2023 AI management system certification.

Key Features and Technical Specifications

Granite 4.0 employs a hybrid design that interleaves a small fraction of self-attention blocks with a majority of Mamba-2 state-space layers in a 9:1 ratio. According to IBM’s technical blog, this architecture can reduce RAM usage by over 70% for long-context and multi-session inference. This reduction translates into lower GPU costs while achieving desired throughput and latency targets. Internal comparisons indicate that the smallest Granite 4.0 models outperform the previous Granite 3.3-8B models despite having fewer parameters.

Training and Performance

Granite 4.0 was trained on samples up to 512K tokens and evaluated on up to 128K tokens. Public checkpoints on Hugging Face are available in BF16 format, with quantized and GGUF conversions also published. FP8 is an execution option on supported hardware, although it is not the format of the released weights.

IBM highlights several performance benchmarks relevant to enterprise applications:

  • IFEval (HELM): Granite-4.0-H-Small leads most open-weights models, trailing only Llama 4 Maverick at a larger scale.
  • BFCLv3 (Function Calling): H-Small is competitive with larger open and closed models at lower price points.
  • MTRAG (multi-turn RAG): Improved reliability on complex retrieval workflows.

Accessing Granite 4.0

Granite 4.0 is available on IBM watsonx.ai and can be accessed through various platforms including Dell Pro AI Studio/Enterprise Hub, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, and Replicate. IBM is also enabling support for vLLM, llama.cpp, NexaML, and MLX for hybrid serving.

Conclusion

The Granite 4.0 models, with their hybrid Mamba-2/Transformer stack and active-parameter MoE, present a practical approach to reducing total cost of ownership (TCO). The significant memory reduction and long-context throughput gains allow for smaller GPU fleets without compromising instruction-following or tool-use accuracy. The availability of BF16 checkpoints with GGUF conversions simplifies local evaluation pipelines, while the ISO/IEC 42001 certification and signed artifacts address compliance and provenance concerns that often hinder enterprise deployment. Overall, Granite 4.0 offers a lean, auditable model family that is easier to integrate into production environments compared to previous 8B-class Transformers.

For more technical details, check out the IBM announcement.

Explore the Hugging Face Model Card and visit our GitHub Page for tutorials, codes, and notebooks. Join our community on Twitter, and subscribe to our newsletter for updates.