SOTAVerified

document understanding

Document understanding involves document classification, layout analysis, information extraction, and DocQA.

Papers

Showing 125 of 309 papers

TitleStatusHype
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends0
PaddleOCR 3.0 Technical ReportCode0
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningCode7
Class-Agnostic Region-of-Interest Matching in Document ImagesCode0
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document ImagesCode0
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models0
PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document UnderstandingCode0
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts0
SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative RefinementCode1
A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions0
DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning0
Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document ParsingCode0
LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real WorldCode1
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning0
MT^3: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning0
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning0
ARB: A Comprehensive Arabic Multimodal Reasoning BenchmarkCode1
The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting0
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?0
Document Image Rectification Bases on Self-Adaptive Multitask Fusion0
Adaptive Markup Language Generation for Contextually-Grounded Visual Document UnderstandingCode1
Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer0
FRAG: Frame Selection Augmented Generation for Long Video and Long Document UnderstandingCode1
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language ModelsCode0
Relation-Rich Visual Document Generator for Visual Information ExtractionCode0
Show:102550
← PrevPage 1 of 13Next →

No leaderboard results yet.