MinerU: An Open-Source Solution for Precise Document Content Extraction Sep 27, 2024 Diversity Optical Character Recognition (OCR)
Code Code Available 16MiniCPM-V: A GPT-4V Level MLLM on Your Phone Aug 3, 2024 Hallucination Multiple-choice
Code Code Available 12SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning Aug 10, 2024 Hallucination Optical Character Recognition
Code Code Available 11General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model Sep 3, 2024 Decoder Math
Code Code Available 9DeepSeek-VL: Towards Real-World Vision-Language Understanding Mar 8, 2024 Chatbot Language Modelling
Code Code Available 7SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Feb 8, 2024 Benchmarking Diversity
Code Code Available 7TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document Mar 7, 2024 document understanding Key Information Extraction
Code Code Available 5Kimi-VL Technical Report Apr 10, 2025 Long-Context Understanding Mathematical Reasoning
Code Code Available 5Nougat: Neural Optical Understanding for Academic Documents Aug 25, 2023 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 5Focus Anywhere for Fine-grained Multi-page Document Understanding May 23, 2024 document understanding Optical Character Recognition (OCR)
Code Code Available 5MixTex: Unambiguous Recognition Should Not Rely Solely on Real Data Jun 24, 2024 Data Augmentation Optical Character Recognition (OCR)
Code Code Available 5LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Apr 28, 2023 Instruction Following model
Code Code Available 5On Path to Multimodal Historical Reasoning: HistBench and HistAgent May 26, 2025 Optical Character Recognition (OCR)
Code Code Available 4OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning Dec 31, 2024 Benchmarking Logical Reasoning
Code Code Available 4MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Sep 4, 2024 Optical Character Recognition (OCR)
Code Code Available 4AnyText: Multilingual Visual Text Generation And Editing Nov 6, 2023 Image Generation Optical Character Recognition (OCR)
Code Code Available 4An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition Jul 21, 2015 Optical Character Recognition (OCR) Scene Text Recognition
Code Code Available 4PaliGemma 2: A Family of Versatile VLMs for Transfer Dec 4, 2024 Language Modeling Language Modelling
Code Code Available 3Image-to-Markup Generation with Coarse-to-Fine Attention Sep 16, 2016 Decoder Optical Character Recognition (OCR)
Code Code Available 3MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities Aug 1, 2024 Math MM-Vet
Code Code Available 3Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models Dec 11, 2023 Chart Understanding Decoder
Code Code Available 3From Panels to Prose: Generating Literary Narratives from Comics Mar 30, 2025 Optical Character Recognition (OCR)
Code Code Available 3Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Dec 5, 2024 Contrastive Learning Hallucination
Code Code Available 3OCR-free Document Understanding Transformer Nov 30, 2021 Document Image Classification document understanding
Code Code Available 3PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System Sep 7, 2021 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 2OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models May 13, 2023 Key Information Extraction Nutrition
Code Code Available 2Real-time Scene Text Detection with Differentiable Binarization Nov 20, 2019 Binarization Optical Character Recognition (OCR)
Code Code Available 22.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Jan 1, 2025 Optical Character Recognition (OCR)
Code Code Available 2Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Oct 7, 2022 Chart Question Answering Diversity
Code Code Available 2PP-OCR: A Practical Ultra Lightweight OCR System Sep 21, 2020 Computational Efficiency Optical Character Recognition
Code Code Available 2NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement Apr 8, 2024 Binarization Document Enhancement
Code Code Available 2Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness Jun 1, 2022 CPU document understanding
Code Code Available 2OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation Dec 3, 2024 Optical Character Recognition Optical Character Recognition (OCR)
Code Code Available 2Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration Jul 7, 2025 Optical Character Recognition (OCR)
Code Code Available 2MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts Oct 3, 2023 Chatbot Image Captioning
Code Code Available 2Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition May 23, 2024 Automatic Speech Recognition Automatic Speech Recognition (ASR)
Code Code Available 2MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories Jun 5, 2025 Benchmarking Optical Character Recognition
Code Code Available 2GUICourse: From General Vision Language Models to Versatile GUI Agents Jun 17, 2024 Natural Language Visual Grounding Optical Character Recognition (OCR)
Code Code Available 2IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling Jan 6, 2023 Link Prediction Optical Character Recognition
Code Code Available 2GlyphControl: Glyph Conditional Control for Visual Text Generation May 29, 2023 Optical Character Recognition (OCR) Text Generation
Code Code Available 2GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures in Text-to-Image Generation Mar 31, 2023 Image Generation Optical Character Recognition (OCR)
Code Code Available 2Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction Nov 19, 2024 document understanding Optical Character Recognition (OCR)
Code Code Available 2Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Jun 10, 2025 Optical Character Recognition (OCR)
Code Code Available 2BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions Aug 19, 2023 MME Optical Character Recognition (OCR)
Code Code Available 2General Detection-based Text Line Recognition Sep 25, 2024 HTR Optical Character Recognition (OCR)
Code Code Available 2LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Jun 29, 2023 16k Image Captioning
Code Code Available 2An Empirical Study of Scaling Law for Scene Text Recognition Jan 1, 2024 Optical Character Recognition (OCR) Scene Text Recognition
Code Code Available 2An Approach for Air Drawing Using Background Subtraction and Contour Extraction Mar 3, 2025 Hand Detection Optical Character Recognition (OCR)
Code Code Available 2MouSi: Poly-Visual-Expert Vision-Language Models Jan 30, 2024 Image Segmentation Image-text matching
Code Code Available 2A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding Jul 2, 2024 document understanding Key Information Extraction
Code Code Available 2