Visual Question Answering (VQA)

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 26–50 of 2167 papers

Title	Date	Tasks	Status	Hype
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	Feb 5, 2024	Science Question AnsweringText-to-Video Generation	CodeCode Available	4
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Nov 16, 2023	Language ModelingLanguage Modelling	CodeCode Available	4
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	Nov 13, 2023	Described Object DetectionLanguage Modeling	CodeCode Available	4
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Nov 7, 2023	1 Image, 2*2 StitchingDecoder	CodeCode Available	4
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	Aug 2, 2023	Visual Question AnsweringVisual Question Answering (VQA)	CodeCode Available	4
Otter: A Multi-Modal Model with In-Context Instruction Tuning	May 5, 2023	GPUIn-Context Learning	CodeCode Available	4
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Apr 27, 2023	Visual Question Answering (VQA)Zero-Shot Video Question Answer	CodeCode Available	4
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	Feb 1, 2023	Action ClassificationImage Classification	CodeCode Available	4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Jan 30, 2023	Generative Visual Question AnsweringImage Captioning	CodeCode Available	4
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	Dec 6, 2022	Action ClassificationAction Recognition	CodeCode Available	4
GLIPv2: Unifying Localization and Vision-Language Understanding	Jun 12, 2022	2D Object DetectionContrastive Learning	CodeCode Available	4
Flamingo: a Visual Language Model for Few-Shot Learning	Apr 29, 2022	Few-Shot LearningGenerative Visual Question Answering	CodeCode Available	4
MMSearch-R1: Incentivizing LMMs to Search	Jun 25, 2025	RAGRetrieval-augmented Generation	CodeCode Available	3
An Empirical Study on Prompt Compression for Large Language Models	Apr 24, 2025	ArticlesMath	CodeCode Available	3
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	Dec 12, 2024	EgoSchema	CodeCode Available	3
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion	Dec 5, 2024	Contrastive LearningHallucination	CodeCode Available	3
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent	Nov 5, 2024	BenchmarkingHallucination	CodeCode Available	3
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models	Oct 16, 2024	DiagnosticHallucination	CodeCode Available	3
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine	Aug 6, 2024	Medical Visual Question AnsweringOrgan Detection	CodeCode Available	3
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale	Jun 27, 2024	Visual Question Answering (VQA)	CodeCode Available	3
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models	Jun 16, 2024	HallucinationHallucination Evaluation	CodeCode Available	3
MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making	Apr 22, 2024	Decision MakingMedical Diagnosis	CodeCode Available	3
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	Apr 8, 2024	GPUMultiple-choice	CodeCode Available	3
Evaluating Text-to-Visual Generation with Image-to-Text Generation	Apr 1, 2024	Image to textQuestion Answering	CodeCode Available	3
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning	Mar 25, 2024	Visual Question Answering (VQA)	CodeCode Available	3

Show:10 25 50

← PrevPage 2 of 87Next →

All datasets GQA Test2019 VQA v2 test-dev VQA v2 test-std OK-VQA MSVD-QA DocVQA test MSRVTT-QA InfographicVQA GQA test-dev VizWiz 2020 VQA A-OKVQA CLEVR

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	human	Accuracy	89.3	—	Unverified
2	DREAM+Unicoder-VL (MSRA)	Accuracy	76.04	—	Unverified
3	TRRNet (Ensemble)	Accuracy	74.03	—	Unverified
4	MIL-nbgao	Accuracy	73.81	—	Unverified
5	Kakao Brain	Accuracy	73.33	—	Unverified
6	Coarse-to-Fine Reasoning, Single Model	Accuracy	72.14	—	Unverified
7	270	Accuracy	70.23	—	Unverified
8	NSM ensemble (updated)	Accuracy	67.55	—	Unverified
9	VinVL-DPT	Accuracy	64.92	—	Unverified
10	VinVL+L	Accuracy	64.85	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	Accuracy	84.3	—	Unverified
2	BEiT-3	Accuracy	84.19	—	Unverified
3	VLMo	Accuracy	82.78	—	Unverified
4	ONE-PEACE	Accuracy	82.6	—	Unverified
5	mPLUG (Huge)	Accuracy	82.43	—	Unverified
6	CuMo-7B	Accuracy	82.2	—	Unverified
7	X2-VLM (large)	Accuracy	81.9	—	Unverified
8	MMU	Accuracy	81.26	—	Unverified
9	InternVL-C	Accuracy	81.2	—	Unverified
10	Lyrics	Accuracy	81.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BEiT-3	overall	84.03	—	Unverified
2	mPLUG-Huge	overall	83.62	—	Unverified
3	ONE-PEACE	overall	82.52	—	Unverified
4	X2-VLM (large)	overall	81.8	—	Unverified
5	VLMo	overall	81.3	—	Unverified
6	SimVLM	overall	80.34	—	Unverified
7	X2-VLM (base)	overall	80.2	—	Unverified
8	VAST	overall	80.19	—	Unverified
9	VALOR	overall	78.62	—	Unverified
10	Prompt Tuning	overall	78.53	—	Unverified