Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model (VLM) to Edge-Class Devices

Understanding the Target Audience for Liquid AI’s LFM2-VL-3B

The target audience for Liquid AI’s LFM2-VL-3B primarily includes developers, data scientists, and business leaders in sectors such as robotics, mobile technology, and industrial applications. These professionals are typically engaged in the integration of advanced AI solutions into their products and services. Their pain points often revolve around the need for efficient processing capabilities, cost-effective solutions, and the ability to maintain data privacy through local processing.

Key goals for this audience include:

Achieving high accuracy in multimodal AI tasks.
Ensuring fast processing speeds to meet real-time application requirements.
Integrating AI models seamlessly into existing workflows and systems.

Interests include advancements in AI technology, particularly in vision-language models, and the practical applications of these technologies in their respective fields. Communication preferences lean towards technical documentation, detailed specifications, and peer-reviewed research to support decision-making.

Overview of Liquid AI’s LFM2-VL-3B

Liquid AI has introduced the LFM2-VL-3B, a 3B parameter vision language model designed for image text to text tasks. This model expands the LFM2-VL family, which previously included 450M and 1.6B variants, targeting enhanced accuracy while maintaining the speed profile of the LFM2 architecture. It is accessible on LEAP and Hugging Face under the LFM Open License v1.0.

Model Architecture and Interface

The LFM2-VL-3B features a dual architecture that combines a language tower with a shape-aware vision tower and a projector. The language tower is based on LFM2-2.6B, utilizing a hybrid convolution plus attention backbone. The vision tower employs SigLIP2 NaFlex with 400M parameters, preserving native aspect ratios to avoid distortion. A two-layer MLP connector compresses image tokens before merging them with the language space, allowing users to manage vision token budgets without needing to retrain the model.

The encoder can process native resolutions up to 512×512. Larger images are divided into non-overlapping 512×512 patches, with a thumbnail pathway providing global context during tiling. Documented token mappings show that a 256×384 image translates to 96 tokens, while a 1000×3000 image corresponds to 1,020 tokens. User controls for minimum and maximum image tokens and tiling options are available to optimize speed and quality during inference.

Inference Settings

The Hugging Face model card outlines recommended parameters for inference. Text generation settings include a temperature of 0.1, minimum probability of 0.15, and a repetition penalty of 1.05. Vision settings recommend a minimum of 64 image tokens and a maximum of 256, with image splitting enabled. The processor automatically applies the chat template and image sentinel. The example implementation utilizes AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.

Training Methodology

Liquid AI employs a staged training approach, beginning with joint mid-training that adjusts the text-to-image ratio over time. This is followed by supervised fine-tuning focused on image understanding. The training data consists of large-scale open datasets and in-house synthetic vision data to ensure comprehensive task coverage.

Performance Benchmarks

According to the research team, the LFM2-VL-3B achieves competitive results among lightweight open vision language models. Performance metrics include:

MM-IFEval: 51.83
RealWorldQA: 71.37
MMBench dev en: 79.81
POPE score: 89.01

The language capabilities are comparable to the LFM2-2.6B backbone, with scores of 30% on GPQA and 63% on MMLU, which are significant for tasks involving knowledge queries. The model also supports expanded multilingual visual understanding across several languages, including English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese, and Korean.

Why Edge Users Should Care

The architecture of LFM2-VL-3B is designed to keep compute and memory usage within the limits of small devices. The compressible image tokens and user-constrained settings ensure predictable throughput. The SigLIP2 400M NaFlex encoder maintains aspect ratios, enhancing fine-grained perception. The projector reduces tokens at the connector, improving tokens per second. The research team has also released a GGUF build for on-device runtimes, making this model particularly useful for robotics, mobile, and industrial clients that require local processing and stringent data boundaries.

Key Takeaways

Compact multimodal stack: The 3B parameter LFM2-VL-3B combines an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex vision encoder and a two-layer MLP projector for efficient image-token fusion.
Resolution handling and token budgets: The model processes images natively up to 512×512, with larger inputs tiled into non-overlapping patches. Documented token mappings include 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.
Inference interface: The model features a ChatML-like prompting system with an sentinel, a default text context of 32,768 tokens, and processor-level controls for image splitting, facilitating reproducible evaluation and integration into multimodal pipelines.
Measured performance: The reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are competitive for this model size, with language-only signals from the backbone showing 30% GPQA and 63% MMLU, beneficial for mixed perception and knowledge tasks.

The LFM2-VL-3B represents a significant advancement for edge multimodal workloads, combining efficiency with robust performance metrics. Open weights, a GGUF build, and access via LEAP minimize integration challenges, making this model a practical choice for businesses seeking to leverage AI in their operations.

For more technical details, check out the model on Hugging Face. You can also explore our GitHub Page for tutorials, codes, and notebooks. Stay updated by following us on Twitter and joining our 100k+ ML SubReddit. Don’t forget to subscribe to our Newsletter. Are you on Telegram? Join us there as well!