SOTAVerified

document understanding

Document understanding involves document classification, layout analysis, information extraction, and DocQA.

Papers

Showing 150 of 309 papers

TitleStatusHype
Qwen2.5-VL Technical ReportCode11
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive PerceptionCode9
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningCode7
ColPali: Efficient Document Retrieval with Vision Language ModelsCode7
Focus Anywhere for Fine-grained Multi-page Document UnderstandingCode5
TextMonkey: An OCR-Free Large Multimodal Model for Understanding DocumentCode5
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image PyramidCode5
LLMMapReduce: Simplified Long-Sequence Processing using Large Language ModelsCode4
MDocAgent: A Multi-Modal Multi-Agent Framework for Document UnderstandingCode3
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary ResolutionCode3
OCR-free Document Understanding TransformerCode3
Unifying Vision, Text, and Layout for Universal Document ProcessingCode3
INTERS: Unlocking the Power of Large Language Models in Search with Instruction TuningCode3
MMLongBench-Doc: Benchmarking Long-context Document Understanding with VisualizationsCode2
Delivering Document Conversion as a Cloud Service with High Throughput and ResponsivenessCode2
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document UnderstandingCode2
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with InstructionsCode2
AIN: The Arabic INclusive Large Multimodal ModelCode2
Visually Guided Generative Text-Layout Pre-training for Document IntelligenceCode2
Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown ExtractionCode2
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse SamplingCode2
ICDAR 2021 Competition on Scientific Literature ParsingCode2
One missing piece in Vision and Language: A Survey on Comics UnderstandingCode2
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document UnderstandingCode2
MedICaT: A Dataset of Medical Images, Captions, and Textual ReferencesCode1
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document UnderstandingCode1
Adaptive Markup Language Generation for Contextually-Grounded Visual Document UnderstandingCode1
M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout AnalysisCode1
Modeling Layout Reading Order as Ordering Relations for Visually-rich Document UnderstandingCode1
End-to-end Document Recognition and Understanding with DessurtCode1
ARB: A Comprehensive Arabic Multimodal Reasoning BenchmarkCode1
Enhancing Visually-Rich Document Understanding via Layout Structure ModelingCode1
CiteWorth: Cite-Worthiness Detection for Improved Scientific Document UnderstandingCode1
LineFormer: Rethinking Line Chart Data Extraction as Instance SegmentationCode1
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and LocatingCode1
Multimodal Pre-training Based on Graph Attention Network for Document UnderstandingCode1
DocQueryNet: Value Retrieval with Arbitrary Queries for Form-like DocumentsCode1
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl DataCode1
DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine ReadingCode1
DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document UnderstandingCode1
CAMEL-Bench: A Comprehensive Arabic LMM BenchmarkCode1
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document UnderstandingCode1
DocFormer: End-to-End Transformer for Document UnderstandingCode1
Docopilot: Improving Multimodal Models for Document-Level UnderstandingCode1
DocFormerv2: Local Features for Document UnderstandingCode1
DocumentCLIP: Linking Figures and Main Body Text in Reflowed DocumentsCode1
LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real WorldCode1
FRAG: Frame Selection Augmented Generation for Long Video and Long Document UnderstandingCode1
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout TransformerCode1
Hierarchical Multimodal Pre-training for Visually Rich Webpage UnderstandingCode1
Show:102550
← PrevPage 1 of 7Next →

No leaderboard results yet.