Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token

Understanding the Target Audience

The primary audience for the LFM2-8B-A1B release includes:

Business Decision Makers: They seek to integrate advanced AI capabilities into products while managing costs, performance, and system requirements.
Data Scientists and AI Engineers: They are interested in efficient architectures, scalability, and real-world use cases for on-device AI applications.
Software Developers: They require clear technical specifications for implementation in consumer-grade hardware.
Researchers: They are focused on the underlying architecture and performance benchmarks to further AI advancements.

Common pain points include the limitations of existing models in terms of device compatibility, latency, and real-time processing capabilities. Their goals center around leveraging AI for enhanced user experiences, reducing operational costs, and ensuring data privacy.

Overview of LFM2-8B-A1B Architecture

The LFM2-8B-A1B model is designed for on-device execution, specifically targeting phones, laptops, and embedded systems. Featuring 8.3B total parameters, the model activates approximately 1.5B parameters per token through sparse expert routing, which optimizes computational efficiency while maximizing representational capacity.

Key technical specifications include:

Total Parameters: 8.3B
Active Parameters per Token: ~1.5B
Context Length: 32,768 tokens
Vocabulary Size: 65,536
Pre-training Budget: ~12T tokens

The architecture incorporates an LFM2 ‘fast backbone’ with 18 gated short-convolution blocks and 6 grouped-query attention (GQA) blocks. Additionally, it employs 32 experts in the MoE blocks, selecting the top-4 experts per token to enhance processing without significantly increasing active compute load.

Performance Insights

Liquid AI’s testing indicates that LFM2-8B-A1B significantly outperforms the Qwen3-1.7B model in CPU tests. The performance is measured under various configurations, demonstrating that the model maintains quality comparable to 3B-4B dense models while operating with an active compute near 1.5B parameters.

On accuracy, the model shows competitive results across 16 benchmarks, including:

MMLU/MMLU-Pro/GPQA (knowledge)
IFEval/IFBench/Multi-IF (instruction following)
GSM8K/GSMPlus/MATH500/MATH-Lvl-5 (math)
MGSM/MMMLU (multilingual)

Deployment and Tooling

LFM2-8B-A1B is compatible with various tools for deployment:

Utilizes Transformers/vLLM for GPU inference
Built for local runs via GGUF, with quantized variants that fit comfortably on consumer hardware
Recommended deployment platforms include llama.cpp and ExecuTorch for mobile applications

Key Takeaways

The LFM2-8B-A1B model demonstrates practical applications of sparse MoE technology in managing memory and latency on consumer devices. It effectively combines a robust architecture with efficient routing mechanisms to deliver high-quality AI capabilities for various applications, including private assistants and embedded systems.

For further exploration, access the model on Hugging Face and refer to the technical details provided by Liquid AI. Additionally, visit their GitHub page for tutorials, codes, and notebooks.

LFM2-8B-A1B architecture diagram

Learn more about LFM2-8B-A1B.