| CogVLM2: Visual Language Models for Image and Video Understanding | Aug 29, 2024 | MM-VetMVBench | CodeCode Available | 9 |
| CogAgent: A Visual Language Model for GUI Agents | Dec 14, 2023 | Language Modeling | CodeCode Available | 5 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |
| MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities | Aug 1, 2024 | MathMM-Vet | CodeCode Available | 3 |
| ShapeLLM: Universal 3D Object Understanding for Embodied Interaction | Feb 27, 2024 | 3D geometry3D Object Captioning | CodeCode Available | 3 |
| Attention Prompting on Image for Large Vision-Language Models | Sep 25, 2024 | MM-VetVisual Prompting | CodeCode Available | 2 |
| Self-Supervised Visual Preference Alignment | Apr 16, 2024 | 8kMM-Vet | CodeCode Available | 2 |
| To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Nov 13, 2023 | Instruction FollowingMM-Vet | CodeCode Available | 2 |
| MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | Aug 4, 2023 | MathMM-Vet | CodeCode Available | 2 |
| Mitigating Object Hallucinations via Sentence-Level Early Intervention | Jul 16, 2025 | HallucinationMM-Vet | CodeCode Available | 1 |
| Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models | Feb 16, 2024 | DiversityInstruction Following | CodeCode Available | 1 |
| Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? | Nov 29, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 |
| Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision | Nov 13, 2023 | HallucinationMM-Vet | CodeCode Available | 1 |
| MR. Judge: Multimodal Reasoner as a Judge | May 19, 2025 | MM-VetMultiple-choice | —Unverified | 0 |
| EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models | Mar 19, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 |
| EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models | Jan 1, 2025 | MM-VetMultimodal Reasoning | —Unverified | 0 |
| OmniFusion Technical Report | Apr 9, 2024 | MM-VetTextVQA | CodeCode Available | 0 |
| DIEM: Decomposition-Integration Enhancing Multimodal Insights | Jan 1, 2024 | MM-VetQuestion Answering | —Unverified | 0 |
| Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model | Oct 31, 2023 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |