| LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | Jun 1, 2023 | Image ClassificationInstruction Following | CodeCode Available | 4 |
| Otter: A Multi-Modal Model with In-Context Instruction Tuning | May 5, 2023 | GPUIn-Context Learning | CodeCode Available | 4 |
| mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video | Feb 1, 2023 | Action ClassificationImage Classification | CodeCode Available | 4 |
| BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Jan 30, 2023 | Generative Visual Question AnsweringImage Captioning | CodeCode Available | 4 |
| Flamingo: a Visual Language Model for Few-Shot Learning | Apr 29, 2022 | Few-Shot LearningGenerative Visual Question Answering | CodeCode Available | 4 |
| InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts | May 25, 2025 | Chart UnderstandingQuestion Answering | CodeCode Available | 3 |
| SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment | Mar 12, 2025 | Autonomous DrivingBench2Drive | CodeCode Available | 3 |
| MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs | Feb 24, 2025 | Question AnsweringVisual Question Answering | CodeCode Available | 3 |
| SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation | Feb 18, 2025 | Object RearrangementRobot Manipulation | CodeCode Available | 3 |
| VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model | Jan 21, 2025 | Image GenerationInstruction Following | CodeCode Available | 3 |
| Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition | Dec 12, 2024 | EgoSchema | CodeCode Available | 3 |
| VisionZip: Longer is Better but Not Necessary in Vision Language Models | Dec 5, 2024 | Video UnderstandingVisual Question Answering | CodeCode Available | 3 |
| Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent | Nov 5, 2024 | BenchmarkingHallucination | CodeCode Available | 3 |
| Baichuan-Omni Technical Report | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 3 |
| Emu3: Next-Token Prediction is All You Need | Sep 27, 2024 | All | CodeCode Available | 3 |
| MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities | Aug 1, 2024 | MathMM-Vet | CodeCode Available | 3 |
| TokenPacker: Efficient Visual Projector for Multimodal LLM | Jul 2, 2024 | Language ModellingLarge Language Model | CodeCode Available | 3 |
| Efficient Multimodal Large Language Models: A Survey | May 17, 2024 | Edge-computingQuestion Answering | CodeCode Available | 3 |
| View Selection for 3D Captioning via Diffusion Ranking | Apr 11, 2024 | 3D Object CaptioningHallucination | CodeCode Available | 3 |
| Evaluating Text-to-Visual Generation with Image-to-Text Generation | Apr 1, 2024 | Image to textQuestion Answering | CodeCode Available | 3 |
| M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models | Mar 31, 2024 | Image-text RetrievalLanguage Modeling | CodeCode Available | 3 |
| MoAI: Mixture of All Intelligence for Large Language and Vision Models | Mar 12, 2024 | AllMixture-of-Experts | CodeCode Available | 3 |
| Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models | Mar 10, 2024 | Visual Question Answering | CodeCode Available | 3 |
| Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Mar 8, 2024 | 1 Image, 2*2 StitchingCode Generation | CodeCode Available | 3 |
| Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models | Mar 5, 2024 | TextVQAVisual Question Answering | CodeCode Available | 3 |