| A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends | Jul 14, 2025 | document understandingOptical Character Recognition | —Unverified | 0 |
| PaddleOCR 3.0 Technical Report | Jul 8, 2025 | document understandingKey Information Extraction | CodeCode Available | 0 |
| GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Jul 1, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 7 |
| Class-Agnostic Region-of-Interest Matching in Document Images | Jun 26, 2025 | Document Layout Analysisdocument understanding | CodeCode Available | 0 |
| DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images | Jun 26, 2025 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 0 |
| Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models | Jun 25, 2025 | document understandingHallucination | —Unverified | 0 |
| PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding | Jun 22, 2025 | document understanding | CodeCode Available | 0 |
| WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | Jun 18, 2025 | document understandingMultiple-choice | —Unverified | 0 |
| SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement | Jun 16, 2025 | document understandingQuestion Answering | CodeCode Available | 1 |
| A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions | Jun 5, 2025 | Computational Efficiencydocument understanding | —Unverified | 0 |
| DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning | Jun 5, 2025 | document understandingEvent Detection | —Unverified | 0 |
| Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing | Jun 1, 2025 | Document AIdocument understanding | CodeCode Available | 0 |
| LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World | Jun 1, 2025 | document understandingEntity Linking | CodeCode Available | 1 |
| Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning | May 26, 2025 | document understandingMultimodal Reasoning | —Unverified | 0 |
| MT^3: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning | May 26, 2025 | document understandingMachine Translation | —Unverified | 0 |
| Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning | May 24, 2025 | document understandingVisual Reasoning | —Unverified | 0 |
| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 |
| The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting | May 19, 2025 | document understandingOptical Character Recognition (OCR) | —Unverified | 0 |
| WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? | May 16, 2025 | document understanding | —Unverified | 0 |
| Document Image Rectification Bases on Self-Adaptive Multitask Fusion | May 9, 2025 | document understanding | —Unverified | 0 |
| Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding | May 8, 2025 | document understandingInstruction Following | CodeCode Available | 1 |
| Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer | May 2, 2025 | document understandingHallucination | —Unverified | 0 |
| FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding | Apr 24, 2025 | document understandingMME | CodeCode Available | 1 |
| Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models | Apr 16, 2025 | document understandingLayout Design | CodeCode Available | 0 |
| Relation-Rich Visual Document Generator for Visual Information Extraction | Apr 14, 2025 | Diversitydocument understanding | CodeCode Available | 0 |
| NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding | Apr 12, 2025 | BenchmarkingDocument AI | —Unverified | 0 |
| QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding | Apr 3, 2025 | document understandingLanguage Modeling | —Unverified | 0 |
| How does Watermarking Affect Visual Language Models in Document Understanding? | Apr 1, 2025 | document understanding | —Unverified | 0 |
| Improving Applicability of Deep Learning based Token Classification models during Training | Mar 28, 2025 | document understandingtoken-classification | —Unverified | 0 |
| M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? | Mar 27, 2025 | Document Summarizationdocument understanding | CodeCode Available | 0 |
| BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction | Mar 25, 2025 | document understandingobject-detection | CodeCode Available | 0 |
| SFDLA: Source-Free Document Layout Analysis | Mar 24, 2025 | AvgDocument Layout Analysis | CodeCode Available | 0 |
| A Simple yet Effective Layout Token in Large Language Models for Document Understanding | Mar 24, 2025 | document understandingPosition | —Unverified | 0 |
| MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 3 |
| Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 0 |
| PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks | Mar 6, 2025 | document understandingLanguage Modeling | CodeCode Available | 0 |
| Zero-Shot Complex Question-Answering on Long Scientific Documents | Mar 4, 2025 | Answer Generationdocument understanding | CodeCode Available | 0 |
| A Token-level Text Image Foundation Model for Document Understanding | Mar 4, 2025 | document understandingVisual Question Answering (VQA) | —Unverified | 0 |
| Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI | Feb 24, 2025 | document understandingMultimodal Reasoning | —Unverified | 0 |
| OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models | Feb 22, 2025 | document understandingKey Information Extraction | CodeCode Available | 0 |
| KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding | Feb 20, 2025 | document understandingOptical Character Recognition | —Unverified | 0 |
| Qwen2.5-VL Technical Report | Feb 19, 2025 | document understanding | CodeCode Available | 11 |
| Assessing Generative AI value in a public sector context: evidence from a field experiment | Feb 13, 2025 | document understanding | —Unverified | 0 |
| DocMIA: Document-Level Membership Inference Attacks against DocVQA Models | Feb 6, 2025 | document understandingInference Attack | CodeCode Available | 0 |
| AIN: The Arabic INclusive Large Multimodal Model | Jan 31, 2025 | document understandingmodel | CodeCode Available | 2 |
| Ocean-OCR: Towards General OCR Application via a Vision-Language Model | Jan 26, 2025 | document understandingLanguage Modeling | CodeCode Available | 1 |
| HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja | Jan 21, 2025 | document understandingMachine Translation | CodeCode Available | 0 |
| BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations | Jan 6, 2025 | Document AIdocument understanding | —Unverified | 0 |
| Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends | Jan 4, 2025 | document understandingQuestion Answering | —Unverified | 0 |
| Docopilot: Improving Multimodal Models for Document-Level Understanding | Jan 1, 2025 | document understandingRAG | CodeCode Available | 1 |