| ShapeLLM: Universal 3D Object Understanding for Embodied Interaction | Feb 27, 2024 | 3D geometry3D Object Captioning | CodeCode Available | 3 |
| Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models | Feb 16, 2024 | DiversityInstruction Following | CodeCode Available | 1 |
| DIEM: Decomposition-Integration Enhancing Multimodal Insights | Jan 1, 2024 | MM-VetQuestion Answering | —Unverified | 0 |
| CogAgent: A Visual Language Model for GUI Agents | Dec 14, 2023 | Language Modeling | CodeCode Available | 5 |
| Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? | Nov 29, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 |
| Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision | Nov 13, 2023 | HallucinationMM-Vet | CodeCode Available | 1 |
| To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning | Nov 13, 2023 | Instruction FollowingMM-Vet | CodeCode Available | 2 |
| Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model | Oct 31, 2023 | Autonomous DrivingLanguage Modeling | —Unverified | 0 |
| MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities | Aug 4, 2023 | MathMM-Vet | CodeCode Available | 2 |