| Mitigating Object Hallucinations via Sentence-Level Early Intervention | Jul 16, 2025 | HallucinationMM-Vet | CodeCode Available | 1 |
| TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance | May 29, 2025 | Image Super-ResolutionOptical Character Recognition | —Unverified | 0 |
| EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models | May 28, 2025 | Mixture-of-ExpertsMME | —Unverified | 0 |
| Analysing the Robustness of Vision-Language-Models to Common Corruptions | Apr 18, 2025 | TextVQA | —Unverified | 0 |
| Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models | Mar 24, 2025 | MMETextVQA | CodeCode Available | 0 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 |
| What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph | Jan 4, 2025 | TextVQA | CodeCode Available | 2 |
| InstructOCR: Instruction Boosting Scene Text Spotting | Dec 20, 2024 | Optical Character Recognition (OCR)Text Spotting | CodeCode Available | 0 |
| Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues | Dec 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |