MinerU: An Open-Source Solution for Precise Document Content Extraction Sep 27, 2024 Diversity Optical Character Recognition (OCR)
Code Code Available 165 MiniCPM-V: A GPT-4V Level MLLM on Your Phone Aug 3, 2024 Hallucination Multiple-choice
Code Code Available 125 SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning Aug 10, 2024 Hallucination Optical Character Recognition
Code Code Available 115 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model Sep 3, 2024 Decoder Math
Code Code Available 95 DeepSeek-VL: Towards Real-World Vision-Language Understanding Mar 8, 2024 Chatbot Language Modelling
Code Code Available 75 SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Feb 8, 2024 Benchmarking Diversity
Code Code Available 75 TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document Mar 7, 2024 document understanding Key Information Extraction
Code Code Available 55 Kimi-VL Technical Report Apr 10, 2025 Long-Context Understanding Mathematical Reasoning
Code Code Available 55 LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Apr 28, 2023 Instruction Following model
Code Code Available 55 MixTex: Unambiguous Recognition Should Not Rely Solely on Real Data Jun 24, 2024 Data Augmentation Optical Character Recognition (OCR)
Code Code Available 55 Nougat: Neural Optical Understanding for Academic Documents Aug 25, 2023 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 55 Focus Anywhere for Fine-grained Multi-page Document Understanding May 23, 2024 document understanding Optical Character Recognition (OCR)
Code Code Available 55 On Path to Multimodal Historical Reasoning: HistBench and HistAgent May 26, 2025 Optical Character Recognition (OCR)
Code Code Available 45 AnyText: Multilingual Visual Text Generation And Editing Nov 6, 2023 Image Generation Optical Character Recognition (OCR)
Code Code Available 45 MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Sep 4, 2024 Optical Character Recognition (OCR)
Code Code Available 45 An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition Jul 21, 2015 Optical Character Recognition (OCR) Scene Text Recognition
Code Code Available 45 OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning Dec 31, 2024 Benchmarking Logical Reasoning
Code Code Available 45 PaliGemma 2: A Family of Versatile VLMs for Transfer Dec 4, 2024 Language Modeling Language Modelling
Code Code Available 35 MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities Aug 1, 2024 Math MM-Vet
Code Code Available 35 Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models Dec 11, 2023 Chart Understanding Decoder
Code Code Available 35 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 35 From Panels to Prose: Generating Literary Narratives from Comics Mar 30, 2025 Optical Character Recognition (OCR)
Code Code Available 35 OCR-free Document Understanding Transformer Nov 30, 2021 Document Image Classification document understanding
Code Code Available 35 Image-to-Markup Generation with Coarse-to-Fine Attention Sep 16, 2016 Decoder Optical Character Recognition (OCR)
Code Code Available 35 PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System Sep 7, 2021 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 25 OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models May 13, 2023 Key Information Extraction Nutrition
Code Code Available 25 Real-time Scene Text Detection with Differentiable Binarization Nov 20, 2019 Binarization Optical Character Recognition (OCR)
Code Code Available 25 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Jan 1, 2025 Optical Character Recognition (OCR)
Code Code Available 25 Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 25 PP-OCR: A Practical Ultra Lightweight OCR System Sep 21, 2020 Computational Efficiency Optical Character Recognition
Code Code Available 25 Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness Jun 1, 2022 CPU document understanding
Code Code Available 25 Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction Nov 19, 2024 document understanding Optical Character Recognition (OCR)
Code Code Available 25 NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement Apr 8, 2024 Binarization Document Enhancement
Code Code Available 25 OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation Dec 3, 2024 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 25 Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration Jul 7, 2025 Optical Character Recognition (OCR)
Code Code Available 25 MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts Oct 3, 2023 Chatbot Image Captioning
Code Code Available 25 LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 25 MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations Jul 1, 2024 Benchmarking document understanding
Code Code Available 25 IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling Jan 6, 2023 Link Prediction Optical Character Recognition
Code Code Available 25 Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition May 23, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 25 An Empirical Study of Scaling Law for Scene Text Recognition Jan 1, 2024 Optical Character Recognition (OCR) Scene Text Recognition
Code Code Available 25 BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Aug 19, 2023 MME Optical Character Recognition (OCR)
Code Code Available 25 Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Jun 10, 2025 Optical Character Recognition (OCR)
Code Code Available 25 An Approach for Air Drawing Using Background Subtraction and Contour Extraction Mar 3, 2025 Hand Detection Optical Character Recognition (OCR)
Code Code Available 25 GlyphControl: Glyph Conditional Control for Visual Text Generation May 29, 2023 Optical Character Recognition (OCR) Text Generation
Code Code Available 25 MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories Jun 5, 2025 Benchmarking Optical Character Recognition
Code Code Available 25 GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures in Text-to-Image Generation Mar 31, 2023 Image Generation Optical Character Recognition (OCR)
Code Code Available 25 General Detection-based Text Line Recognition Sep 25, 2024 HTR Optical Character Recognition (OCR)
Code Code Available 25 GIT: A Generative Image-to-text Transformer for Vision and Language May 27, 2022 Decoder Image Captioning
Code Code Available 25 GUICourse: From General Vision Language Models to Versatile GUI Agents Jun 17, 2024 Natural Language Visual Grounding Optical Character Recognition (OCR)
Code Code Available 25