| Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues | Dec 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization | Feb 12, 2024 | In-Context LearningTextVQA | CodeCode Available | 0 | 5 |
| InstructOCR: Instruction Boosting Scene Text Spotting | Dec 20, 2024 | Optical Character Recognition (OCR)Text Spotting | CodeCode Available | 0 | 5 |
| Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model | Jun 24, 2021 | DecoderLanguage Modeling | —Unverified | 0 | 0 |
| Analysing the Robustness of Vision-Language-Models to Common Corruptions | Apr 18, 2025 | TextVQA | —Unverified | 0 | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |
| EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model | Aug 21, 2024 | Computational EfficiencyLanguage Modeling | —Unverified | 0 | 0 |
| Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy | Nov 23, 2024 | Instruction FollowingMME | —Unverified | 0 | 0 |
| EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models | May 28, 2025 | Mixture-of-ExpertsMME | —Unverified | 0 | 0 |
| Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA | Oct 13, 2023 | Graph LearningObject | —Unverified | 0 | 0 |
| FlexAttention for Efficient High-Resolution Vision-Language Models | Jul 29, 2024 | TextVQA | —Unverified | 0 | 0 |
| Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture | Nov 11, 2021 | Graph AttentionQuestion Answering | —Unverified | 0 | 0 |
| HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models | Dec 11, 2024 | TextVQA | —Unverified | 0 | 0 |
| Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA | Apr 4, 2023 | Answer GenerationLanguage Modelling | —Unverified | 0 | 0 |
| Making the V in Text-VQA Matter | Aug 1, 2023 | Optical Character Recognition (OCR)TextVQA | —Unverified | 0 | 0 |
| Multiple-Question Multiple-Answer Text-VQA | Nov 15, 2023 | DecoderDenoising | —Unverified | 0 | 0 |
| SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering | Dec 16, 2022 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| Sentence Attention Blocks for Answer Grounding | Sep 20, 2023 | Question AnsweringSentence | —Unverified | 0 | 0 |
| TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text | May 12, 2021 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance | May 29, 2025 | Image Super-ResolutionOptical Character Recognition | —Unverified | 0 | 0 |
| Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering | Sep 21, 2022 | Image CaptioningOptical Character Recognition (OCR) | —Unverified | 0 | 0 |
| Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering | Mar 24, 2022 | Optical Character RecognitionOptical Character Recognition (OCR) | —Unverified | 0 | 0 |