«`html
DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Performance OCR and Structured Document Conversion
DeepSeek-AI has launched the 3B DeepSeek-OCR, an end-to-end OCR and document parsing Vision-Language Model (VLM) system. This model compresses long text into a small set of vision tokens and decodes those tokens using a language model. The approach leverages the compact representations of text in images, significantly reducing the sequence length for the decoder. The research team reports a 97% decoding precision when text tokens are within 10 times the vision tokens on the Fox benchmark, demonstrating useful performance even at 20 times compression. Additionally, it shows competitive results on OmniDocBench using far fewer tokens than common baselines.
Architecture: What is Actually New?
DeepSeek-OCR-3B consists of two main components: a vision encoder called DeepEncoder and a Mixture of Experts decoder named DeepSeek3B-MoE-A570M. The encoder is optimized for high-resolution inputs with low activation costs and a minimal output token count. It employs a window attention stage based on SAM for local perception, followed by a two-layer convolutional compressor for 16× token downsampling, and a dense global attention stage based on CLIP for visual knowledge aggregation. This design effectively manages activation memory at high resolutions while keeping the vision token count low. The decoder is a 3B parameter MoE model with approximately 570M active parameters per token.
Multi-Resolution Modes: Engineered for Token Budgets
DeepEncoder supports both native and dynamic modes. Native modes include:
- Tiny: 64 tokens at 512 by 512 pixels
- Small: 100 tokens at 640 by 640
- Base: 256 tokens at 1024 by 1024
- Large: 400 tokens at 1280 by 1280
Dynamic modes, named Gundam and Gundam-Master, mix tiled local views with a global view, allowing for flexible token budgeting based on page complexity.
Compression Results: What the Numbers Say
The Fox benchmark study measures precision as the exact text match after decoding. With 100 vision tokens, pages containing 600 to 700 text tokens achieve 98.5% precision at 6.7× compression. Pages with 900 to 1000 text tokens reach 96.8% precision at 9.7× compression. For 64 vision tokens, precision decreases with higher compression, e.g., 59.1% at approximately 19.7× for 1200 to 1300 text tokens.
On OmniDocBench, DeepSeek-OCR outperforms GOT-OCR 2.0 using only 100 vision tokens per page, and under 800 vision tokens, it surpasses MinerU 2.0, which uses over 6000 tokens per page on average.
Training Details That Matter
The research team outlines a two-phase training pipeline. Initially, DeepEncoder is trained with next token prediction on OCR 1.0 and OCR 2.0 data, along with 100M LAION samples. The complete system is then trained using pipeline parallelism across 4 partitions. The training utilized 20 nodes, each equipped with 8 A100 40G GPUs, employing the AdamW optimizer. The team reports a training speed of 90B tokens per day on text-only data and 70B tokens per day on multimodal data. In production, it generates over 200k pages per day on a single A100 40G node.
How to Evaluate It in a Practical Stack
For typical reports or books, start with the Small mode at 100 tokens, adjusting upward only if the edit distance is unacceptable. For pages with dense small fonts or high token counts, consider using a Gundam mode, which combines global and local fields of view with explicit token budgeting. If your workload includes charts, tables, or chemical structures, refer to the “Deep parsing” qualitative section for examples of converting to HTML tables and SMILES, ensuring outputs are easy to validate.
Key Takeaways
- DeepSeek OCR targets token efficiency using optical context compression with near lossless decoding at about 10 times compression and around 60% precision at about 20 times compression.
- The HF release exposes explicit token budgets: Tiny uses 64 tokens at 512 by 512, Small uses 100 tokens at 640 by 640, Base uses 256 tokens at 1024 by 1024, and Large uses 400 tokens at 1280 by 1280.
- The system comprises a DeepEncoder that compresses pages into vision tokens and a DeepSeek3B MoE decoder with approximately 570M active parameters.
- The Hugging Face model card documents a tested setup for immediate use: Python 3.12.9, CUDA 11.8, PyTorch 2.6.0, Transformers 4.46.3, Tokenizers 0.20.3, and Flash Attention 2.7.3.
Conclusion
DeepSeek OCR represents a significant advancement in document AI, treating pages as compact optical carriers that reduce decoder sequence length without losing critical information. The model reports 97% decoding precision at 10× compression on the Fox benchmark, a key claim to validate in real workloads. The model is packaged for Transformers and includes a tested setup that lowers the setup cost for engineers. Overall, DeepSeek OCR operationalizes optical context compression with a 3B MoE decoder, providing explicit token budget modes and a tested Transformers setup.
Check out the Technical Paper, Model on GitHub, and feel free to follow us on Twitter. Join our 100k+ ML SubReddit and subscribe to our Newsletter. You can also connect with us on Telegram.
«`