| CogVLM2: Visual Language Models for Image and Video Understanding | Aug 29, 2024 | MM-VetMVBench | CodeCode Available | 9 |
| TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document | Mar 7, 2024 | document understandingKey Information Extraction | CodeCode Available | 5 |
| CogVLM: Visual Expert for Pretrained Language Models | Nov 6, 2023 | 1 Image, 2*2 StitchingFS-MEVQA | CodeCode Available | 5 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |
| LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images | Mar 18, 2024 | Long-Context UnderstandingTextVQA | CodeCode Available | 3 |
| Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models | Mar 5, 2024 | TextVQAVisual Question Answering | CodeCode Available | 3 |
| Towards VQA Models That Can Read | Apr 18, 2019 | TextVQAVisual Question Answering (VQA) | CodeCode Available | 3 |
| Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding | Jan 14, 2025 | image-classificationImage Classification | CodeCode Available | 2 |
| What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph | Jan 4, 2025 | TextVQA | CodeCode Available | 2 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |