«`html

Apple Released FastVLM: A Novel Hybrid Vision Encoder which is 85x Faster and 3.4x Smaller than Comparable Sized Vision Language Models (VLMs)

Introduction

Vision Language Models (VLMs) facilitate both text inputs and visual understanding, with image resolution being a crucial factor affecting performance when processing text and chart-rich data. High image resolution introduces several challenges:

Pretrained vision encoders typically find it hard to cope with high-resolution images due to inefficient pretraining requirements.
High-resolution images result in increased computational costs and latency during visual token generation, whether approached through single high-resolution processing or multiple lower-resolution tile strategies.
The generation of more tokens from high-resolution images delays the LLM prefilling time, which encompasses the vision encoder’s latency and the LLM’s prefilling time.

Existing VLM Architectures

Current multimodal models such as Frozen and Florence utilize cross-attention to integrate image and text embeddings within LLM’s intermediate layers. Effective auto-regressive architectures include LLaVA, mPLUG-Owl, MiniGPT-4, and Cambrian-1. For efficient image encoding, CLIP-pretrained vision transformers are extensively used, with variants like SigLIP, EVA-CLIP, InternViT, and DFNCLIP. Techniques like LLaVA-PruMerge and Matryoshka-based token sampling aim for dynamic token pruning, while hierarchical backbones such as ConvNeXT and FastViT reduce token count through progressive downsampling. Additionally, ConvLLaVA has been introduced, leveraging a pure-convolutional vision encoder for VLM image encoding.

Apple’s FastVLM

Researchers from Apple have introduced FastVLM, a model achieving an optimized trade-off among resolution, latency, and accuracy. This model is predicated on analyzing how image quality, processing time, and the quantity of tokens affect one another. FastVLM employs FastViTHD, a hybrid vision encoder aiming to diminish the number of tokens produced while expediting encoding time for high-resolution images.

Key features of FastVLM include:

An improved 3.2 times reduction in time-to-first-token (TTFT) in the LLaVA1.5 framework.
Superior performance benchmarks relative to LLaVA-OneVision at maximum resolution using the same 0.5B LLM.
Enhanced speeds, delivering 85 times quicker TTFT while utilizing a vision encoder that is 3.4 times smaller.

All FastVLM models are trained on a single node utilizing 8 NVIDIA H100-80GB GPUs, with initial training efficiencies allowing for a duration of around 30 minutes when utilizing a Qwen2-7B decoder. The FastViTHD improves the base FastViT architecture by incorporating an additional stage with a downsampling layer, which ensures self-attention is performed on tensors downsampled by a factor of 32 instead of 16. This approach minimizes image encoding latency while generating four times fewer tokens for the LLM decoder. FastViTHD consists of five stages: the first three stages leverage RepMixer blocks, whereas the final two deploy multi-headed self-attention blocks, achieving an optimal balance between computational efficiency and high-resolution image understanding.

Benchmark Comparisons

In comparative evaluations with ConvLLaVA using an equivalent LLM and similar training data, FastVLM demonstrates an 8.4% performance enhancement on TextVQA and a 12.5% improvement on DocVQA while operating at 22% faster speeds. The performance benefits become more pronounced at higher resolutions, where FastVLM maintains processing speeds that are twice as fast as those of ConvLLaVA across various benchmarks. FastVLM matches or exceeds MM1 performance across a diverse range of benchmarks, achieving this by employing intermediate pretraining with 15 million samples for improved resolution scaling, while producing 5 times fewer visual tokens. Additionally, FastVLM surpasses Cambrian-1’s performance, operating 7.9 times faster. With optimized instruction tuning, it yields superior results while generating 2.3 times fewer visual tokens.

Conclusion

In summary, FastVLM represents a significant advancement in Vision Language Models through the implementation of the FastViTHD vision backbone, facilitating efficient high-resolution image encoding. This hybrid architecture is pretrained on reinforced image-text datasets, which reduces visual token output while making minimal sacrifices in accuracy compared to existing methodologies. FastVLM achieves competitive performance across various VLM benchmarks while offering notable enhancements in both time-to-first-token and the parameter count of its vision backbone.

For more detailed insights, please review the paper and explore the model on Hugging Face. All credit for this research is due to the contributing researchers of this project. Additionally, we encourage you to follow us on Twitter and subscribe to our Newsletter for ongoing updates.

External illustration
«`