Baidu’s PaddlePaddle Team Releases PaddleOCR-VL (0.9B): a NaViT-style + ERNIE-4.5-0.3B VLM Targeting End-to-End Multilingual Document Parsing

Understanding the Target Audience for PaddleOCR-VL

The target audience for Baidu’s PaddlePaddle Team’s release of PaddleOCR-VL (0.9B) primarily includes:

Data Scientists and Machine Learning Engineers: These professionals are interested in advanced tools for document parsing and are likely to seek solutions that enhance data extraction accuracy and efficiency.
Business Analysts: They focus on how AI can streamline business processes, particularly in handling multilingual documents and complex layouts.
Software Developers: Developers looking to integrate document parsing capabilities into applications will find PaddleOCR-VL’s structured outputs appealing.
Researchers: Individuals conducting studies in AI, machine learning, and natural language processing will be interested in the technical aspects and performance benchmarks.

The pain points of this audience typically include:

Challenges in accurately parsing complex documents with varied layouts and languages.
Concerns regarding inference latency and memory usage in real-world applications.
The need for tools that support multiple languages, including less common scripts.

Their goals include:

Improving the accuracy and speed of document processing.
Integrating advanced AI capabilities into existing workflows.
Staying updated on the latest advancements in AI and document parsing technologies.

Interests often revolve around:

New methodologies in machine learning and document processing.
Case studies demonstrating successful AI implementations.
Technical discussions and peer-reviewed research findings.

Preferred communication methods include:

Technical reports and whitepapers for in-depth understanding.
Webinars and workshops for hands-on learning.
Online forums and community discussions for peer interaction.

PaddleOCR-VL Overview

Baidu’s PaddlePaddle team has introduced PaddleOCR-VL, a 0.9B-parameter vision-language model designed for efficient end-to-end document parsing. This model is capable of handling a variety of content types, including text, tables, formulas, charts, and handwriting, producing structured outputs in Markdown and JSON formats.

The architecture of PaddleOCR-VL integrates a NaViT-style (Native-resolution ViT) dynamic-resolution vision encoder with an ERNIE-4.5-0.3B decoder, enabling it to support 109 languages. This includes the ability to process complex layouts and small scripts effectively.

System Design

PaddleOCR-VL operates as a two-stage pipeline:

Stage One (PP-DocLayoutV2): This stage performs page-level layout analysis using an RT-DETR detector to localize and classify regions, while a pointer network predicts the reading order.
Stage Two (PaddleOCR-VL-0.9B): This stage focuses on element-level recognition based on the detected layout, aggregating final outputs for downstream applications.

This design effectively mitigates the long-sequence decoding latency and instability often encountered by end-to-end vision-language models, particularly on dense, multi-column pages.

Technical Specifications

PaddleOCR-VL-0.9B employs a NaViT-style dynamic high-resolution encoder, which utilizes native-resolution sequence packing along with a 2-layer MLP projector. The model also incorporates 3D-RoPE for positional representation. According to the technical report, this native-resolution processing approach results in lower hallucinations and improved performance on text-dense documents compared to fixed-resize or tiling methods.

Performance Benchmarks

PaddleOCR-VL has demonstrated state-of-the-art results on the OmniDocBench v1.5 and competitive scores on v1.0, excelling in various sub-tasks such as:

Text edit distances
Formula-CDM
Table-TEDS/TEDS-S
Reading-order edit

It also shows complementary strengths on olmOCR-Bench and internal evaluations for handwriting, tables, formulas, and charts.

Key Takeaways

0.9B-parameter PaddleOCR-VL integrates a NaViT-style dynamic-resolution encoder with ERNIE-4.5-0.3B for comprehensive document parsing.
Targets end-to-end extraction across various content types with structured Markdown/JSON outputs.
Claims state-of-the-art performance on public document benchmarks with rapid inference suitable for real-world applications.
Supports 109 languages, including complex layouts and small scripts.

This release signifies an important advancement in document parsing technology, combining efficiency with high accuracy and broad language support. For further details, refer to the Technical Paper, explore the Model on Hugging Face, and visit the GitHub Page for Tutorials, Codes, and Notebooks.

Additionally, connect with the community through Twitter, join the ML SubReddit, and subscribe to the Newsletter. You can also join the conversation on Telegram.