Visual Question Answering

MLLM Leaderboard

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 2177 papers

Title	Date	Tasks	Status	Hype
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	Jan 29, 2025	Image Generation	CodeCode Available	11
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	Nov 12, 2024	Language ModelingLanguage Modelling	CodeCode Available	11
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	Oct 17, 2024	Visual Question Answering	CodeCode Available	11
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Sep 18, 2024	Natural Language Visual Grounding	CodeCode Available	11
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning	Aug 10, 2024	HallucinationOptical Character Recognition	CodeCode Available	11
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness	May 27, 2024	HallucinationImage Captioning	CodeCode Available	11
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Dec 13, 2024	Chart UnderstandingMixture-of-Experts	CodeCode Available	9
CogVLM2: Visual Language Models for Image and Video Understanding	Aug 29, 2024	MM-VetMVBench	CodeCode Available	9
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models	Apr 11, 2024	Language ModelingLanguage Modelling	CodeCode Available	9
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step	Nov 15, 2024	Logical ReasoningMultimodal Reasoning	CodeCode Available	7
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	Aug 9, 2024	Language ModelingLanguage Modelling	CodeCode Available	7
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining	Aug 5, 2024	DecoderDepth Estimation	CodeCode Available	7
Chameleon: Mixed-Modal Early-Fusion Foundation Models	May 16, 2024	Image CaptioningImage Generation	CodeCode Available	7
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	Mar 27, 2024	Image ClassificationImage Comprehension	CodeCode Available	7
DeepSeek-VL: Towards Real-World Vision-Language Understanding	Mar 8, 2024	ChatbotLanguage Modelling	CodeCode Available	7
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	Feb 8, 2024	BenchmarkingDiversity	CodeCode Available	7
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	Jan 29, 2024	HallucinationMixture-of-Experts	CodeCode Available	7
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	Oct 14, 2023	Image ClassificationImage Description	CodeCode Available	7
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Apr 20, 2023	Image DescriptionLanguage Modelling	CodeCode Available	7
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	Dec 1, 2023	HallucinationImage Captioning	CodeCode Available	6
Improved Baselines with Visual Instruction Tuning	Oct 5, 2023	Factual Inconsistency Detection in Chart CaptioningImage Classification	CodeCode Available	6
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models	Sep 18, 2023	Visual Question Answering	CodeCode Available	6
Visual Instruction Tuning	Apr 17, 2023	1 Image, 2*2 Stitching3D Question Answering (3D-QA)	CodeCode Available	6
GPT-4 Technical Report	Mar 15, 2023	answerability predictionArithmetic Reasoning	CodeCode Available	6
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	Aug 22, 2024	10-shot image generation	CodeCode Available	5
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	Jun 12, 2024	Image GenerationLanguage Modeling	CodeCode Available	5
Wings: Learning Multimodal LLMs without Text-only Forgetting	Jun 5, 2024	Question AnsweringVisual Question Answering	CodeCode Available	5
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts	May 18, 2024	Mixture-of-ExpertsVisual Question Answering	CodeCode Available	5
CogAgent: A Visual Language Model for GUI Agents	Dec 14, 2023	Language Modeling	CodeCode Available	5
CogVLM: Visual Expert for Pretrained Language Models	Nov 6, 2023	1 Image, 2*2 StitchingFS-MEVQA	CodeCode Available	5
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Aug 24, 2023	Chart Question AnsweringFS-MEVQA	CodeCode Available	5
MMBench: Is Your Multi-modal Model an All-around Player?	Jul 12, 2023	AllInstruction Following	CodeCode Available	5
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	Apr 28, 2023	Instruction Followingmodel	CodeCode Available	5
Scaling Up Biomedical Vision-Language Models: Fine-Tuning, Instruction Tuning, and Multi-Modal Learning	May 23, 2025	DecoderImage Captioning	CodeCode Available	4
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model	Mar 30, 2025	Autonomous DrivingDecision Making	CodeCode Available	4
A Survey on Vision-Language-Action Models for Embodied AI	May 23, 2024	Image CaptioningInstruction Following	CodeCode Available	4
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning	May 2, 2024	Autonomous Drivingcounterfactual	CodeCode Available	4
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	Feb 29, 2024	AllHallucination	CodeCode Available	4
TinyLLaVA: A Framework of Small-scale Large Multimodal Models	Feb 22, 2024	Visual Question Answering	CodeCode Available	4
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM	Feb 14, 2024	Medical Visual Question AnsweringQuestion Answering	CodeCode Available	4
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models	Feb 12, 2024	HallucinationObject Localization	CodeCode Available	4
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	Feb 5, 2024	Science Question AnsweringText-to-Video Generation	CodeCode Available	4
GPT-4V(ision) is a Generalist Web Agent, if Grounded	Jan 3, 2024	Image CaptioningQuestion Answering	CodeCode Available	4
VILA: On Pre-training for Visual Language Models	Dec 12, 2023	In-Context LearningLanguage Modelling	CodeCode Available	4
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Nov 16, 2023	Language ModelingLanguage Modelling	CodeCode Available	4
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	Nov 13, 2023	Described Object DetectionLanguage Modeling	CodeCode Available	4
OtterHD: A High-Resolution Multi-modality Model	Nov 7, 2023	modelVisual Question Answering	CodeCode Available	4
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Nov 7, 2023	1 Image, 2*2 StitchingDecoder	CodeCode Available	4
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	Aug 2, 2023	Visual Question AnsweringVisual Question Answering (VQA)	CodeCode Available	4
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Jun 8, 2023	In-Context LearningVisual Question Answering	CodeCode Available	4

Show:10 25 50

← PrevPage 1 of 44Next →

All datasets MM-Vet ViP-Bench VQA v2 test-dev BenchLMM MMBench V*bench VQA v2 val MSRVTT-QA VQA v2 test-std MMHal-Bench MSVD-QA PlotQA-D1

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	MMCTAgent (GPT-4 + GPT-4V)	GPT-4 score	74.24	—	Unverified
2	Qwen2-VL-72B	GPT-4 score	74	—	Unverified
3	InternVL2.5-78B	GPT-4 score	72.3	—	Unverified
4	GPT-4o +text rationale +IoT	GPT-4 score	72.2	—	Unverified
5	Lyra-Pro	GPT-4 score	71.4	—	Unverified
6	GLM-4V-Plus	GPT-4 score	71.1	—	Unverified
7	Phantom-7B	GPT-4 score	70.8	—	Unverified
8	InternVL2.5-38B	GPT-4 score	68.8	—	Unverified
9	InternVL2-26B (SGP, token ratio 64%)	GPT-4 score	65.6	—	Unverified
10	Baichuan-Omni (7B)	GPT-4 score	65.4	—	Unverified