| Analysing the Robustness of Vision-Language-Models to Common Corruptions | Apr 18, 2025 | TextVQA | —Unverified | 0 |
| Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models | Mar 24, 2025 | MMETextVQA | CodeCode Available | 0 |
| InstructOCR: Instruction Boosting Scene Text Spotting | Dec 20, 2024 | Optical Character Recognition (OCR)Text Spotting | CodeCode Available | 0 |
| Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues | Dec 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models | Dec 11, 2024 | TextVQA | —Unverified | 0 |
| Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy | Nov 23, 2024 | Instruction FollowingMME | —Unverified | 0 |
| EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model | Aug 21, 2024 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| FlexAttention for Efficient High-Resolution Vision-Language Models | Jul 29, 2024 | TextVQA | —Unverified | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| OmniFusion Technical Report | Apr 9, 2024 | MM-VetTextVQA | CodeCode Available | 0 |