| Qwen2.5-VL Technical Report | Feb 19, 2025 | document understanding | CodeCode Available | 11 |
| DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception | Oct 16, 2024 | Document Layout Analysisdocument understanding | CodeCode Available | 9 |
| GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Jul 1, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 7 |
| ColPali: Efficient Document Retrieval with Vision Language Models | Jun 27, 2024 | document understandingRAG | CodeCode Available | 7 |
| Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid | Aug 4, 2024 | document understanding | CodeCode Available | 5 |
| Focus Anywhere for Fine-grained Multi-page Document Understanding | May 23, 2024 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 5 |
| TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | Mar 7, 2024 | document understandingKey Information Extraction | CodeCode Available | 5 |
| LLMMapReduce: Simplified Long-Sequence Processing using Large Language Models | Oct 12, 2024 | document understanding | CodeCode Available | 4 |
| MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding | Mar 18, 2025 | document understandingQuestion Answering | CodeCode Available | 3 |
| Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution | Sep 19, 2024 | document understandingVideo Question Answering | CodeCode Available | 3 |
| INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning | Jan 12, 2024 | Diversitydocument understanding | CodeCode Available | 3 |
| Unifying Vision, Text, and Layout for Universal Document Processing | Dec 5, 2022 | Document AIdocument understanding | CodeCode Available | 3 |
| OCR-free Document Understanding Transformer | Nov 30, 2021 | Document Image Classificationdocument understanding | CodeCode Available | 3 |
| AIN: The Arabic INclusive Large Multimodal Model | Jan 31, 2025 | document understandingmodel | CodeCode Available | 2 |
| Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction | Nov 19, 2024 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 2 |
| PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling | Oct 8, 2024 | document understandingLanguage Modeling | CodeCode Available | 2 |
| One missing piece in Vision and Language: A Survey on Comics Understanding | Sep 14, 2024 | document understandingimage-classification | CodeCode Available | 2 |
| A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding | Jul 2, 2024 | document understandingKey Information Extraction | CodeCode Available | 2 |
| MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations | Jul 1, 2024 | Benchmarkingdocument understanding | CodeCode Available | 2 |
| Visually Guided Generative Text-Layout Pre-training for Document Intelligence | Mar 25, 2024 | Document Classificationdocument understanding | CodeCode Available | 2 |
| InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions | Jan 24, 2024 | document understandingQuestion Answering | CodeCode Available | 2 |
| Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness | Jun 1, 2022 | CPUdocument understanding | CodeCode Available | 2 |
| LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding | Feb 28, 2022 | Document Image Classificationdocument understanding | CodeCode Available | 2 |
| ICDAR 2021 Competition on Scientific Literature Parsing | Jun 8, 2021 | document understandingobject-detection | CodeCode Available | 2 |
| SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement | Jun 16, 2025 | document understandingQuestion Answering | CodeCode Available | 1 |
| LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World | Jun 1, 2025 | document understandingEntity Linking | CodeCode Available | 1 |
| ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark | May 22, 2025 | document understandingMultimodal Reasoning | CodeCode Available | 1 |
| Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding | May 8, 2025 | document understandingInstruction Following | CodeCode Available | 1 |
| FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding | Apr 24, 2025 | document understandingMME | CodeCode Available | 1 |
| Ocean-OCR: Towards General OCR Application via a Vision-Language Model | Jan 26, 2025 | document understandingLanguage Modeling | CodeCode Available | 1 |
| DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding | Jan 1, 2025 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 1 |
| Docopilot: Improving Multimodal Models for Document-Level Understanding | Jan 1, 2025 | document understandingRAG | CodeCode Available | 1 |
| LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating | Dec 24, 2024 | document understandingQuestion Answering | CodeCode Available | 1 |
| Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models | Dec 18, 2024 | document understandingImage Captioning | CodeCode Available | 1 |
| CAMEL-Bench: A Comprehensive Arabic LMM Benchmark | Oct 24, 2024 | document understandingVideo Understanding | CodeCode Available | 1 |
| Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding | Sep 29, 2024 | document understandingEntity Linking | CodeCode Available | 1 |
| DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding | Aug 27, 2024 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 1 |
| VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding | Jul 17, 2024 | document understandingOptical Character Recognition (OCR) | CodeCode Available | 1 |
| DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents | Jul 12, 2024 | Document Layout Analysisdocument understanding | CodeCode Available | 1 |
| Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning | Jun 4, 2024 | document understandingGPU | CodeCode Available | 1 |
| Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding | Feb 28, 2024 | document understandingInformation Retrieval | CodeCode Available | 1 |
| On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling | Jan 25, 2024 | DecoderDiversity | CodeCode Available | 1 |
| WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data | Dec 15, 2023 | document understandingQuestion Answering | CodeCode Available | 1 |
| Privacy-Aware Document Visual Question Answering | Dec 15, 2023 | document understandingFederated Learning | CodeCode Available | 1 |
| Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs | Nov 22, 2023 | document understandingInstruction Following | CodeCode Available | 1 |
| DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading | Oct 23, 2023 | Document AIdocument understanding | CodeCode Available | 1 |
| Enhancing Visually-Rich Document Understanding via Layout Structure Modeling | Aug 15, 2023 | document understanding | CodeCode Available | 1 |
| DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents | Jun 9, 2023 | Contrastive Learningdocument understanding | CodeCode Available | 1 |
| DocFormerv2: Local Features for Document Understanding | Jun 2, 2023 | Decoderdocument understanding | CodeCode Available | 1 |
| PaLI-X: On Scaling up a Multilingual Vision and Language Model | May 29, 2023 | Chart Question Answeringdocument understanding | CodeCode Available | 1 |