| Qwen2.5-VL Technical Report | Feb 19, 2025 | document understanding | CodeCode Available | 11 |
| DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception | Oct 16, 2024 | Document Layout Analysisdocument understanding | CodeCode Available | 9 |
| ColPali: Efficient Document Retrieval with Vision Language Models | Jun 27, 2024 | document understandingRAG | CodeCode Available | 7 |
| GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Jul 1, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 7 |
| Focus Anywhere for Fine-grained Multi-page Document Understanding | May 23, 2024 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 5 |
| Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid | Aug 4, 2024 | document understanding | CodeCode Available | 5 |
| TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | Mar 7, 2024 | document understandingKey Information Extraction | CodeCode Available | 5 |
| LLMMapReduce: Simplified Long-Sequence Processing using Large Language Models | Oct 12, 2024 | document understanding | CodeCode Available | 4 |
| MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 3 |
| Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution | Sep 19, 2024 | document understandingVideo Question Answering | CodeCode Available | 3 |
| Unifying Vision, Text, and Layout for Universal Document Processing | Dec 5, 2022 | Document AIdocument understanding | CodeCode Available | 3 |
| INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning | Jan 12, 2024 | Diversitydocument understanding | CodeCode Available | 3 |
| OCR-free Document Understanding Transformer | Nov 30, 2021 | Document Image Classificationdocument understanding | CodeCode Available | 3 |
| MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations | Jul 1, 2024 | Benchmarkingdocument understanding | CodeCode Available | 2 |
| LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding | Feb 28, 2022 | Document Image Classificationdocument understanding | CodeCode Available | 2 |
| One missing piece in Vision and Language: A Survey on Comics Understanding | Sep 14, 2024 | document understandingimage-classification | CodeCode Available | 2 |
| Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction | Nov 19, 2024 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 2 |
| A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding | Jul 2, 2024 | document understandingKey Information Extraction | CodeCode Available | 2 |
| Visually Guided Generative Text-Layout Pre-training for Document Intelligence | Mar 25, 2024 | Document Classificationdocument understanding | CodeCode Available | 2 |
| AIN: The Arabic INclusive Large Multimodal Model | Jan 31, 2025 | document understandingmodel | CodeCode Available | 2 |
| PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling | Oct 8, 2024 | document understandingLanguage Modeling | CodeCode Available | 2 |
| Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness | Jun 1, 2022 | CPUdocument understanding | CodeCode Available | 2 |
| ICDAR 2021 Competition on Scientific Literature Parsing | Jun 8, 2021 | document understandingobject-detection | CodeCode Available | 2 |
| InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions | Jan 24, 2024 | document understandingQuestion Answering | CodeCode Available | 2 |
| MedICaT: A Dataset of Medical Images, Captions, and Textual References | Oct 12, 2020 | document understandingImage-text matching | CodeCode Available | 1 |
| LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating | Dec 24, 2024 | document understandingQuestion Answering | CodeCode Available | 1 |
| Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding | May 8, 2025 | document understandingInstruction Following | CodeCode Available | 1 |
| M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis | Jan 1, 2023 | ArticlesDocument Layout Analysis | CodeCode Available | 1 |
| Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding | Sep 29, 2024 | document understandingEntity Linking | CodeCode Available | 1 |
| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 |
| LineFormer: Rethinking Line Chart Data Extraction as Instance Segmentation | May 3, 2023 | Data Visualizationdocument understanding | CodeCode Available | 1 |
| LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World | Jun 1, 2025 | document understandingEntity Linking | CodeCode Available | 1 |
| CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding | May 23, 2021 | document understandingDomain Adaptation | CodeCode Available | 1 |
| Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning | Jun 4, 2024 | document understandingGPU | CodeCode Available | 1 |
| Multimodal Pre-training Based on Graph Attention Network for Document Understanding | Mar 25, 2022 | document understandingGraph Attention | CodeCode Available | 1 |
| FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding | Apr 24, 2025 | document understandingMME | CodeCode Available | 1 |
| ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding | Oct 12, 2022 | document-image-classificationDocument Image Classification | CodeCode Available | 1 |
| Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer | Feb 18, 2021 | DecoderDocument Image Classification | CodeCode Available | 1 |
| Document Understanding Dataset and Evaluation (DUDE) | May 15, 2023 | Document AIdocument understanding | CodeCode Available | 1 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 |
| DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents | Jul 12, 2024 | Document Layout Analysisdocument understanding | CodeCode Available | 1 |
| DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents | Jun 9, 2023 | Contrastive Learningdocument understanding | CodeCode Available | 1 |
| CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data | Apr 28, 2023 | document understandingLanguage Modeling | CodeCode Available | 1 |
| Enhancing Visually-Rich Document Understanding via Layout Structure Modeling | Aug 15, 2023 | document understanding | CodeCode Available | 1 |
| End-to-end Document Recognition and Understanding with Dessurt | Mar 30, 2022 | document understandingVisual Question Answering (VQA) | CodeCode Available | 1 |
| Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding | Feb 28, 2024 | document understandingInformation Retrieval | CodeCode Available | 1 |
| DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding | Jan 1, 2025 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 1 |
| Docopilot: Improving Multimodal Models for Document-Level Understanding | Jan 1, 2025 | document understandingRAG | CodeCode Available | 1 |
| DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding | Aug 27, 2024 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 1 |
| DocQueryNet: Value Retrieval with Arbitrary Queries for Form-like Documents | Oct 1, 2022 | document understandingForm | CodeCode Available | 1 |