document understanding

Document understanding involves document classification, layout analysis, information extraction, and DocQA.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 309 papers

Title	Date	Tasks	Status	Hype
Qwen2.5-VL Technical Report	Feb 19, 2025	document understanding	CodeCode Available	11
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception	Oct 16, 2024	Document Layout Analysisdocument understanding	CodeCode Available	9
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning	Jul 1, 2025	document understandingMultimodal Reasoning	CodeCode Available	7
ColPali: Efficient Document Retrieval with Vision Language Models	Jun 27, 2024	document understandingRAG	CodeCode Available	7
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid	Aug 4, 2024	document understanding	CodeCode Available	5
Focus Anywhere for Fine-grained Multi-page Document Understanding	May 23, 2024	document understandingOptical Character Recognition (OCR)	CodeCode Available	5
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	Mar 7, 2024	document understandingKey Information Extraction	CodeCode Available	5
LLMMapReduce: Simplified Long-Sequence Processing using Large Language Models	Oct 12, 2024	document understanding	CodeCode Available	4
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding	Mar 18, 2025	document understandingQuestion Answering	CodeCode Available	3
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	Sep 19, 2024	document understandingVideo Question Answering	CodeCode Available	3
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning	Jan 12, 2024	Diversitydocument understanding	CodeCode Available	3
Unifying Vision, Text, and Layout for Universal Document Processing	Dec 5, 2022	Document AIdocument understanding	CodeCode Available	3
OCR-free Document Understanding Transformer	Nov 30, 2021	Document Image Classificationdocument understanding	CodeCode Available	3
AIN: The Arabic INclusive Large Multimodal Model	Jan 31, 2025	document understandingmodel	CodeCode Available	2
Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction	Nov 19, 2024	document understandingOptical Character Recognition (OCR)	CodeCode Available	2
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling	Oct 8, 2024	document understandingLanguage Modeling	CodeCode Available	2
One missing piece in Vision and Language: A Survey on Comics Understanding	Sep 14, 2024	document understandingimage-classification	CodeCode Available	2
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding	Jul 2, 2024	document understandingKey Information Extraction	CodeCode Available	2
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations	Jul 1, 2024	Benchmarkingdocument understanding	CodeCode Available	2
Visually Guided Generative Text-Layout Pre-training for Document Intelligence	Mar 25, 2024	Document Classificationdocument understanding	CodeCode Available	2
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions	Jan 24, 2024	document understandingQuestion Answering	CodeCode Available	2
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness	Jun 1, 2022	CPUdocument understanding	CodeCode Available	2
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding	Feb 28, 2022	Document Image Classificationdocument understanding	CodeCode Available	2
ICDAR 2021 Competition on Scientific Literature Parsing	Jun 8, 2021	document understandingobject-detection	CodeCode Available	2
SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement	Jun 16, 2025	document understandingQuestion Answering	CodeCode Available	1
LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World	Jun 1, 2025	document understandingEntity Linking	CodeCode Available	1
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark	May 22, 2025	document understandingMultimodal Reasoning	CodeCode Available	1
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding	May 8, 2025	document understandingInstruction Following	CodeCode Available	1
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding	Apr 24, 2025	document understandingMME	CodeCode Available	1
Ocean-OCR: Towards General OCR Application via a Vision-Language Model	Jan 26, 2025	document understandingLanguage Modeling	CodeCode Available	1
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding	Jan 1, 2025	document understandingOptical Character Recognition (OCR)	CodeCode Available	1
Docopilot: Improving Multimodal Models for Document-Level Understanding	Jan 1, 2025	document understandingRAG	CodeCode Available	1
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating	Dec 24, 2024	document understandingQuestion Answering	CodeCode Available	1
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models	Dec 18, 2024	document understandingImage Captioning	CodeCode Available	1
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark	Oct 24, 2024	document understandingVideo Understanding	CodeCode Available	1
Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding	Sep 29, 2024	document understandingEntity Linking	CodeCode Available	1
DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding	Aug 27, 2024	document understandingOptical Character Recognition (OCR)	CodeCode Available	1
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding	Jul 17, 2024	document understandingOptical Character Recognition (OCR)	CodeCode Available	1
DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents	Jul 12, 2024	Document Layout Analysisdocument understanding	CodeCode Available	1
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning	Jun 4, 2024	document understandingGPU	CodeCode Available	1
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding	Feb 28, 2024	document understandingInformation Retrieval	CodeCode Available	1
On the Affinity, Rationality, and Diversity of Hierarchical Topic Modeling	Jan 25, 2024	DecoderDiversity	CodeCode Available	1
WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data	Dec 15, 2023	document understandingQuestion Answering	CodeCode Available	1
Privacy-Aware Document Visual Question Answering	Dec 15, 2023	document understandingFederated Learning	CodeCode Available	1
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs	Nov 22, 2023	document understandingInstruction Following	CodeCode Available	1
DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading	Oct 23, 2023	Document AIdocument understanding	CodeCode Available	1
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling	Aug 15, 2023	document understanding	CodeCode Available	1
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents	Jun 9, 2023	Contrastive Learningdocument understanding	CodeCode Available	1
DocFormerv2: Local Features for Document Understanding	Jun 2, 2023	Decoderdocument understanding	CodeCode Available	1
PaLI-X: On Scaling up a Multilingual Vision and Language Model	May 29, 2023	Chart Question Answeringdocument understanding	CodeCode Available	1

Show:10 25 50

← PrevPage 1 of 7Next →

No leaderboard results yet.