| Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling | Jan 29, 2025 | Image Generation | CodeCode Available | 11 |
| JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | Nov 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 11 |
| Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation | Oct 17, 2024 | Visual Question Answering | CodeCode Available | 11 |
| Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Sep 18, 2024 | Natural Language Visual Grounding | CodeCode Available | 11 |
| SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning | Aug 10, 2024 | HallucinationOptical Character Recognition | CodeCode Available | 11 |
| RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness | May 27, 2024 | HallucinationImage Captioning | CodeCode Available | 11 |
| DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | Dec 13, 2024 | Chart UnderstandingMixture-of-Experts | CodeCode Available | 9 |
| CogVLM2: Visual Language Models for Image and Video Understanding | Aug 29, 2024 | MM-VetMVBench | CodeCode Available | 9 |
| Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models | Apr 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 9 |
| LLaVA-CoT: Let Vision Language Models Reason Step-by-Step | Nov 15, 2024 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 7 |
| mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | Aug 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 7 |
| Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining | Aug 5, 2024 | DecoderDepth Estimation | CodeCode Available | 7 |
| Chameleon: Mixed-Modal Early-Fusion Foundation Models | May 16, 2024 | Image CaptioningImage Generation | CodeCode Available | 7 |
| Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | Mar 27, 2024 | Image ClassificationImage Comprehension | CodeCode Available | 7 |
| DeepSeek-VL: Towards Real-World Vision-Language Understanding | Mar 8, 2024 | ChatbotLanguage Modelling | CodeCode Available | 7 |
| SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models | Feb 8, 2024 | BenchmarkingDiversity | CodeCode Available | 7 |
| MoE-LLaVA: Mixture of Experts for Large Vision-Language Models | Jan 29, 2024 | HallucinationMixture-of-Experts | CodeCode Available | 7 |
| MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | Oct 14, 2023 | Image ClassificationImage Description | CodeCode Available | 7 |
| MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models | Apr 20, 2023 | Image DescriptionLanguage Modelling | CodeCode Available | 7 |
| RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | Dec 1, 2023 | HallucinationImage Captioning | CodeCode Available | 6 |
| Improved Baselines with Visual Instruction Tuning | Oct 5, 2023 | Factual Inconsistency Detection in Chart CaptioningImage Classification | CodeCode Available | 6 |
| An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models | Sep 18, 2023 | Visual Question Answering | CodeCode Available | 6 |
| Visual Instruction Tuning | Apr 17, 2023 | 1 Image, 2*2 Stitching3D Question Answering (3D-QA) | CodeCode Available | 6 |
| GPT-4 Technical Report | Mar 15, 2023 | answerability predictionArithmetic Reasoning | CodeCode Available | 6 |
| Show-o: One Single Transformer to Unify Multimodal Understanding and Generation | Aug 22, 2024 | 10-shot image generation | CodeCode Available | 5 |
| VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks | Jun 12, 2024 | Image GenerationLanguage Modeling | CodeCode Available | 5 |
| Wings: Learning Multimodal LLMs without Text-only Forgetting | Jun 5, 2024 | Question AnsweringVisual Question Answering | CodeCode Available | 5 |
| Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts | May 18, 2024 | Mixture-of-ExpertsVisual Question Answering | CodeCode Available | 5 |
| CogAgent: A Visual Language Model for GUI Agents | Dec 14, 2023 | Language Modeling | CodeCode Available | 5 |
| CogVLM: Visual Expert for Pretrained Language Models | Nov 6, 2023 | 1 Image, 2*2 StitchingFS-MEVQA | CodeCode Available | 5 |
| Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond | Aug 24, 2023 | Chart Question AnsweringFS-MEVQA | CodeCode Available | 5 |
| MMBench: Is Your Multi-modal Model an All-around Player? | Jul 12, 2023 | AllInstruction Following | CodeCode Available | 5 |
| LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model | Apr 28, 2023 | Instruction Followingmodel | CodeCode Available | 5 |
| Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning | May 23, 2025 | DecoderImage Captioning | CodeCode Available | 4 |
| OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | Mar 30, 2025 | Autonomous DrivingDecision Making | CodeCode Available | 4 |
| A Survey on Vision-Language-Action Models for Embodied AI | May 23, 2024 | Image CaptioningInstruction Following | CodeCode Available | 4 |
| OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning | May 2, 2024 | Autonomous Drivingcounterfactual | CodeCode Available | 4 |
| The All-Seeing Project V2: Towards General Relation Comprehension of the Open World | Feb 29, 2024 | AllHallucination | CodeCode Available | 4 |
| TinyLLaVA: A Framework of Small-scale Large Multimodal Models | Feb 22, 2024 | Visual Question Answering | CodeCode Available | 4 |
| OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM | Feb 14, 2024 | Medical Visual Question AnsweringQuestion Answering | CodeCode Available | 4 |
| Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models | Feb 12, 2024 | HallucinationObject Localization | CodeCode Available | 4 |
| Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization | Feb 5, 2024 | Science Question AnsweringText-to-Video Generation | CodeCode Available | 4 |
| GPT-4V(ision) is a Generalist Web Agent, if Grounded | Jan 3, 2024 | Image CaptioningQuestion Answering | CodeCode Available | 4 |
| VILA: On Pre-training for Visual Language Models | Dec 12, 2023 | In-Context LearningLanguage Modelling | CodeCode Available | 4 |
| Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Nov 16, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 4 |
| SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models | Nov 13, 2023 | Described Object DetectionLanguage Modeling | CodeCode Available | 4 |
| OtterHD: A High-Resolution Multi-modality Model | Nov 7, 2023 | modelVisual Question Answering | CodeCode Available | 4 |
| mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | Nov 7, 2023 | 1 Image, 2*2 StitchingDecoder | CodeCode Available | 4 |
| OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models | Aug 2, 2023 | Visual Question AnsweringVisual Question Answering (VQA) | CodeCode Available | 4 |
| MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Jun 8, 2023 | In-Context LearningVisual Question Answering | CodeCode Available | 4 |