Apps

Baidu Unlimited-OCR: One-Shot Long-Horizon Document Parsing

Published on June 25, 2026 by Wasim

What Is Unlimited-OCR?

Baidu Unlimited-OCR GitHub repository — Unlimited OCR Works: Welcome the Era of One-shot Long-horizon Parsing

Unlimited-OCR is an open-source document-parsing model from Baidu that pushes optical character recognition well past the single-image era. Its tagline sets the ambition plainly: "Unlimited OCR Works: Welcome the Era of One-shot Long-horizon Parsing."

Where most OCR tools read one page at a time and lose the thread on long or complex documents, Unlimited-OCR is built to parse lengthy, multi-page documents in one shot — preserving structure across pages, columns, tables, and forms. It builds directly on DeepSeek-OCR and draws on methodologies from Baidu's own PaddleOCR, packaging them into a vision-language model you can run yourself.

At the time of writing it has over 7,000 GitHub stars, is released under the MIT license, and ships with a companion arXiv paper.

The Problem It Solves

Traditional OCR pipelines were designed for short, clean inputs: a receipt, a single scanned page, a screenshot. Feed them a 40-page report with multi-column layouts, nested tables, and footnotes, and they fall apart — text gets reordered, tables collapse, and context that spans pages is lost entirely.

"Long-horizon parsing" is the answer to that. Instead of treating each page as an isolated image, Unlimited-OCR processes extended documents while keeping layout and reading order coherent across the whole thing. That's the difference between extracting characters and actually understanding a document.

Architecture and Model

Under the hood, Unlimited-OCR takes a vision-language model (VLM) approach rather than the classic detect-then-recognize OCR pipeline. A few notable design points:

It's based on a transformer architecture and extends DeepSeek-OCR for long-horizon parsing.
It uses custom logit processors during generation to prevent the repetitive-output loops that long-document VLMs are prone to — a practical fix for a very real failure mode when decoding thousands of tokens.
It offers two configuration modes — gundam and base — letting you trade off resolution/quality against speed and memory.

The model weights are published on the Hugging Face Hub (and mirrored on ModelScope), so you can pull them directly into your own inference stack.

Key Features

One-shot long-horizon parsing for extended, structurally complex documents.
Single-image processing with the gundam or base configuration modes.
Multi-page and PDF support for batch document parsing.
Streaming inference through an OpenAI-compatible API.
Batch processing with concurrent request handling.
Repetition control via custom logit processors for stable long outputs.

The Tech Stack

Unlimited-OCR is a Python project that sits on top of the mainstream ML serving ecosystem:

Core framework: PyTorch with Hugging Face Transformers.
High-performance serving: SGLang for streaming, OpenAI-compatible inference.
Document handling: PyMuPDF for converting PDFs into page images, plus torchvision and einops.
Requirements: Python 3.12.3+ and a recent CUDA toolchain (CUDA 12.9).

Installation and Usage

The repository provides two inference paths, depending on whether you want a quick test or a production server.

1. Transformers API

Load the model directly from Hugging Face and run inference in-process — the simplest way to parse a single image or a handful of pages:

from transformers import AutoModel

model = AutoModel.from_pretrained("baidu/Unlimited-OCR", trust_remote_code=True)
# parse a single image with a configuration mode (gundam or base)
result = model.parse("document.png", mode="gundam")

2. SGLang Server

For throughput and streaming, serve the model with SGLang and hit it through the OpenAI-compatible endpoint. This path is the one to use for batch PDF processing and concurrent requests.

Across both paths you can tune parameters like max_length, image_size, and ngram_window to balance accuracy, speed, and the aggressiveness of repetition suppression. PDFs are automatically converted to images before parsing, so you can throw whole files at it without manual preprocessing.

Supported Formats and Use Cases

Unlimited-OCR handles individual images (JPG, PNG), multi-page documents, and PDF files (auto-converted to images). That makes it a fit for:

Document digitization and archival — turning scanned back-catalogs into structured text.
Complex layout parsing — forms, tables, and multi-column text that break conventional OCR.
Batch PDF processing at enterprise scale.
Enterprise document management where reading order and structure matter.

The Research Behind It

Unlimited-OCR isn't just a code drop — it's backed by the paper "Unlimited OCR Works", published on arXiv in June 2026 by Youyang Yin, Huanhuan Liu, and colleagues at Baidu. If you want to understand the long-horizon parsing approach in depth — rather than just running the model — the paper is the primary source.

Final Thoughts

Unlimited-OCR is a clear signal of where document AI is heading: away from brittle, page-at-a-time pipelines and toward vision-language models that read a whole document the way a person would. The fact that Baidu open-sourced it under MIT, published the weights on Hugging Face, and shipped a paper alongside makes it unusually easy to evaluate and adopt. If you work with long, messy, real-world documents, it's well worth a test run.

Explore the project on GitHub or grab the weights from Hugging Face.

OpenMontage: Turn Your AI Coding Assistant Into a Video Studio — another open-source, AI-native production project.
Paca: The AI-Native, Open-Source Alternative to Jira — open-source AI tooling for teams.
Why I Built My Own Tool Library — on owning your own tooling.
Browse all ToolShed tools — free, browser-based developer utilities.