document understanding

Document understanding involves document classification, layout analysis, information extraction, and DocQA.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 309 papers

Title	Date	Tasks	Status	Hype
Qwen2.5-VL Technical Report	Feb 19, 2025	document understanding	CodeCode Available	11
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception	Oct 16, 2024	Document Layout Analysisdocument understanding	CodeCode Available	9
ColPali: Efficient Document Retrieval with Vision Language Models	Jun 27, 2024	document understandingRAG	CodeCode Available	7
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning	Jul 1, 2025	document understandingMultimodal Reasoning	CodeCode Available	7
Focus Anywhere for Fine-grained Multi-page Document Understanding	May 23, 2024	document understandingOptical Character Recognition (OCR)	CodeCode Available	5
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid	Aug 4, 2024	document understanding	CodeCode Available	5
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	Mar 7, 2024	document understandingKey Information Extraction	CodeCode Available	5
LLMMapReduce: Simplified Long-Sequence Processing using Large Language Models	Oct 12, 2024	document understanding	CodeCode Available	4
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding	Mar 18, 2025	document understandingQuestion Answering	CodeCode Available	3
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	Sep 19, 2024	document understandingVideo Question Answering	CodeCode Available	3
Unifying Vision, Text, and Layout for Universal Document Processing	Dec 5, 2022	Document AIdocument understanding	CodeCode Available	3
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning	Jan 12, 2024	Diversitydocument understanding	CodeCode Available	3
OCR-free Document Understanding Transformer	Nov 30, 2021	Document Image Classificationdocument understanding	CodeCode Available	3
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations	Jul 1, 2024	Benchmarkingdocument understanding	CodeCode Available	2
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding	Feb 28, 2022	Document Image Classificationdocument understanding	CodeCode Available	2
One missing piece in Vision and Language: A Survey on Comics Understanding	Sep 14, 2024	document understandingimage-classification	CodeCode Available	2
Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction	Nov 19, 2024	document understandingOptical Character Recognition (OCR)	CodeCode Available	2
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding	Jul 2, 2024	document understandingKey Information Extraction	CodeCode Available	2
Visually Guided Generative Text-Layout Pre-training for Document Intelligence	Mar 25, 2024	Document Classificationdocument understanding	CodeCode Available	2
AIN: The Arabic INclusive Large Multimodal Model	Jan 31, 2025	document understandingmodel	CodeCode Available	2
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling	Oct 8, 2024	document understandingLanguage Modeling	CodeCode Available	2
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness	Jun 1, 2022	CPUdocument understanding	CodeCode Available	2
ICDAR 2021 Competition on Scientific Literature Parsing	Jun 8, 2021	document understandingobject-detection	CodeCode Available	2
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions	Jan 24, 2024	document understandingQuestion Answering	CodeCode Available	2
MedICaT: A Dataset of Medical Images, Captions, and Textual References	Oct 12, 2020	document understandingImage-text matching	CodeCode Available	1
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating	Dec 24, 2024	document understandingQuestion Answering	CodeCode Available	1
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding	May 8, 2025	document understandingInstruction Following	CodeCode Available	1
M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis	Jan 1, 2023	ArticlesDocument Layout Analysis	CodeCode Available	1
Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding	Sep 29, 2024	document understandingEntity Linking	CodeCode Available	1
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark	May 22, 2025	document understandingMultimodal Reasoning	CodeCode Available	1
LineFormer: Rethinking Line Chart Data Extraction as Instance Segmentation	May 3, 2023	Data Visualizationdocument understanding	CodeCode Available	1
LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World	Jun 1, 2025	document understandingEntity Linking	CodeCode Available	1
CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding	May 23, 2021	document understandingDomain Adaptation	CodeCode Available	1
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning	Jun 4, 2024	document understandingGPU	CodeCode Available	1
Multimodal Pre-training Based on Graph Attention Network for Document Understanding	Mar 25, 2022	document understandingGraph Attention	CodeCode Available	1
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding	Apr 24, 2025	document understandingMME	CodeCode Available	1
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding	Oct 12, 2022	document-image-classificationDocument Image Classification	CodeCode Available	1
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer	Feb 18, 2021	DecoderDocument Image Classification	CodeCode Available	1
Document Understanding Dataset and Evaluation (DUDE)	May 15, 2023	Document AIdocument understanding	CodeCode Available	1
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark	Oct 24, 2024	document understandingVideo Understanding	CodeCode Available	1
DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents	Jul 12, 2024	Document Layout Analysisdocument understanding	CodeCode Available	1
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents	Jun 9, 2023	Contrastive Learningdocument understanding	CodeCode Available	1
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data	Apr 28, 2023	document understandingLanguage Modeling	CodeCode Available	1
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling	Aug 15, 2023	document understanding	CodeCode Available	1
End-to-end Document Recognition and Understanding with Dessurt	Mar 30, 2022	document understandingVisual Question Answering (VQA)	CodeCode Available	1
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding	Feb 28, 2024	document understandingInformation Retrieval	CodeCode Available	1
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding	Jan 1, 2025	document understandingOptical Character Recognition (OCR)	CodeCode Available	1
Docopilot: Improving Multimodal Models for Document-Level Understanding	Jan 1, 2025	document understandingRAG	CodeCode Available	1
DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding	Aug 27, 2024	document understandingOptical Character Recognition (OCR)	CodeCode Available	1
DocQueryNet: Value Retrieval with Arbitrary Queries for Form-like Documents	Oct 1, 2022	document understandingForm	CodeCode Available	1

Show:10 25 50

← PrevPage 1 of 7Next →

No leaderboard results yet.