What Happened
RedNote has released a multimodal Optical Character Recognition (OCR) model called Multimodal OCR. According to an announcement highlighted by the X account @HuggingPapers, the model contains 3 billion parameters and is designed to parse complex documents containing mixed content.
What the Model Does
The core function of the model is to convert visual document elements into structured, machine-readable formats. Specifically, it can process:
- Text
- Charts
- Diagrams
- Tables
The output is not plain text but structured formats suitable for further processing or web display, including:
- Markdown
- HTML
- SVG (Scalable Vector Graphics)
- LaTeX
This suggests the model goes beyond traditional text extraction to understand the semantic and structural roles of different elements on a page, reconstructing them in a format that preserves their original intent.
Reported Performance
The announcement makes a specific performance claim: the model ranks second only to Gemini 3 Pro on OCR benchmarks. Google's Gemini 3 Pro is a significantly larger, state-of-the-art multimodal model. This positioning implies that RedNote's specialized 3B model is competitive with or surpasses other generalist and specialist models in document understanding tasks, though the specific benchmarks and scores were not detailed in the source.
Context & Availability
The model appears to be publicly released. The source includes a link (https://t.co/055ezUJ1A6), which likely points to the model page on a platform like Hugging Face or a dedicated release announcement, where further technical details, usage examples, and access information would be available.





