| HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models | Dec 11, 2024 | TextVQA | —Unverified | 0 |
| Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy | Nov 23, 2024 | Instruction FollowingMME | —Unverified | 0 |
| CogVLM2: Visual Language Models for Image and Video Understanding | Aug 29, 2024 | MM-VetMVBench | CodeCode Available | 9 |
| EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model | Aug 21, 2024 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| FlexAttention for Efficient High-Resolution Vision-Language Models | Jul 29, 2024 | TextVQA | —Unverified | 0 |
| DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs | Jun 6, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models | Jun 3, 2024 | Image CaptioningLanguage Modelling | CodeCode Available | 2 |
| OmniFusion Technical Report | Apr 9, 2024 | MM-VetTextVQA | CodeCode Available | 0 |
| LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images | Mar 18, 2024 | Long-Context UnderstandingTextVQA | CodeCode Available | 3 |
| Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering | Mar 14, 2024 | Optical Character RecognitionOptical Character Recognition (OCR) | CodeCode Available | 0 |