| CogVLM2: Visual Language Models for Image and Video Understanding | Aug 29, 2024 | MM-VetMVBench | CodeCode Available | 9 | 5 |
| TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | Mar 7, 2024 | document understandingKey Information Extraction | CodeCode Available | 5 | 5 |
| CogVLM: Visual Expert for Pretrained Language Models | Nov 6, 2023 | 1 Image, 2*2 StitchingFS-MEVQA | CodeCode Available | 5 | 5 |
| Towards VQA Models That Can Read | Apr 18, 2019 | TextVQAVisual Question Answering (VQA) | CodeCode Available | 3 | 5 |
| LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images | Mar 18, 2024 | Long-Context UnderstandingTextVQA | CodeCode Available | 3 | 5 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 | 5 |
| Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models | Mar 5, 2024 | TextVQAVisual Question Answering | CodeCode Available | 3 | 5 |
| What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph | Jan 4, 2025 | TextVQA | CodeCode Available | 2 | 5 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 | 5 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 | 5 |
| TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation | Aug 3, 2022 | Answer GenerationQuestion-Answer-Generation | CodeCode Available | 1 | 5 |
| LaTr: Layout-Aware Transformer for Scene-Text VQA | Dec 23, 2021 | Optical Character Recognition (OCR)Question Answering | CodeCode Available | 1 | 5 |
| RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering | Oct 24, 2020 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 1 | 5 |
| Mitigating Object Hallucinations via Sentence-Level Early Intervention | Jul 16, 2025 | HallucinationMM-Vet | CodeCode Available | 1 | 5 |
| A First Look: Towards Explainable TextVQA Models via Visual and Textual Explanations | Apr 29, 2021 | TextVQA | CodeCode Available | 1 | 5 |
| Spatially Aware Multimodal Transformers for TextVQA | Jul 23, 2020 | Optical Character Recognition (OCR)Spatial Reasoning | CodeCode Available | 1 | 5 |
| Structured Multimodal Attentions for TextVQA | Jun 1, 2020 | Graph AttentionOptical Character Recognition (OCR) | CodeCode Available | 1 | 5 |
| TAP: Text-Aware Pre-training for Text-VQA and Text-Caption | Dec 8, 2020 | Caption GenerationLanguage Modeling | CodeCode Available | 1 | 5 |
| Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering | Mar 14, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 0 | 5 |
| OmniFusion Technical Report | Apr 9, 2024 | MM-VetTextVQA | CodeCode Available | 0 | 5 |
| Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models | Mar 24, 2025 | MMETextVQA | CodeCode Available | 0 | 5 |
| Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA | Nov 14, 2019 | General ClassificationTextVQA | CodeCode Available | 0 | 5 |
| Towards a Unified Multimodal Reasoning Framework | Dec 22, 2023 | Multimodal ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps | Dec 9, 2020 | DecoderImage Captioning | CodeCode Available | 0 | 5 |
| Separate and Locate: Rethink the Text in Text-based Visual Question Answering | Aug 31, 2023 | Optical Character Recognition (OCR)Position | CodeCode Available | 0 | 5 |