«`html

Meet dots.ocr: A New 1.7B Vision-Language Model that Achieves SOTA Performance on Multilingual Document Parsing

Understanding the Target Audience

The target audience for dots.ocr includes data scientists, machine learning engineers, and business managers in industries such as finance, legal, and education. Their pain points often involve:

Difficulty in extracting structured data from multilingual documents.
Challenges in maintaining document layout and structure during data extraction.
Need for efficient processing of large volumes of documents across various languages.

Their goals include:

Implementing reliable OCR solutions that enhance productivity.
Improving accuracy in data extraction tasks.
Utilizing open-source tools to reduce costs and increase flexibility.

Interests typically revolve around advancements in AI, machine learning applications in business, and open-source technologies. Communication preferences lean towards technical documentation, case studies, and community-driven forums.

Overview of dots.ocr

dots.ocr is an open-source vision-language transformer model designed for multilingual document layout parsing and optical character recognition (OCR). It performs both layout detection and content recognition within a single architecture, supporting over 100 languages and a variety of structured and unstructured document types.

Architecture

Unified Model: dots.ocr combines layout detection and content recognition into a single transformer-based neural network. This design simplifies the process by allowing users to switch tasks through input prompts.

Parameters: The model comprises 1.7 billion parameters, balancing computational efficiency with performance for practical applications.

Input Flexibility: It accepts image files or PDF documents, featuring preprocessing options like fitz_preprocess to optimize quality for low-resolution or dense multi-page files.

Capabilities

Multilingual: dots.ocr is trained on datasets encompassing over 100 languages, including both major world languages and less common scripts.

Content Extraction: The model extracts plain text, tabular data, and mathematical formulas (in LaTeX), preserving the reading order within documents. Output formats include structured JSON, Markdown, and HTML.

Preserves Structure: dots.ocr maintains document structure, including table boundaries, formula regions, and image placements, ensuring that extracted data remains true to the original document.

Benchmark Performance

dots.ocr has been evaluated against modern document AI systems, with notable results:

Table TEDS accuracy: 88.6% (compared to Gemini2.5-Pro at 85.8%)
Text edit distance: 0.032 (versus Gemini2.5-Pro at 0.055)
Formulas and Layout: Matches or exceeds leading models in formula recognition and document structure reconstruction.

Deployment and Integration

Open-Source: dots.ocr is released under the MIT license, with source code, documentation, and pre-trained models available on GitHub. The repository provides installation instructions for pip, Conda, and Docker-based deployments.

API and Scripting: The model supports flexible task configuration via prompt templates and can be used interactively or within automated pipelines for batch document processing.

Output Formats: Extracted results are provided in structured JSON, with options for Markdown and HTML as needed. Visualization scripts enable inspection of detected layouts.

Conclusion

dots.ocr offers a technical solution for high-accuracy, multilingual document parsing by unifying layout detection and content recognition in a single, open-source model. It is particularly suited for scenarios requiring robust, language-agnostic document analysis and structured information extraction in resource-constrained or production environments.

Check out the GitHub Page for tutorials, code, and notebooks. Follow us on Twitter and join our 100k+ ML SubReddit community. Don’t forget to subscribe to our newsletter for updates.

«`