Visual Question Answering (VQA)

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–75 of 2167 papers

Title	Date	Tasks	Status	Hype
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review	Mar 4, 2024	Medical Report GenerationQuestion Answering	CodeCode Available	3
PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers	Feb 13, 2024	Question AnsweringRetrieval	CodeCode Available	3
Common Sense Reasoning for Deepfake Detection	Jan 31, 2024	Binary ClassificationCommon Sense Reasoning	CodeCode Available	3
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones	Dec 28, 2023	Computational EfficiencyImage Captioning	CodeCode Available	3
DriveLM: Driving with Graph Visual Question Answering	Dec 21, 2023	Autonomous DrivingQuestion Answering	CodeCode Available	3
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	Nov 11, 2023	Image CaptioningMMR total	CodeCode Available	3
Emu: Generative Pretraining in Multimodality	Jul 11, 2023	Image CaptioningImage Generation	CodeCode Available	3
CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning	Jun 30, 2023	Causal InferenceMedical Report Generation	CodeCode Available	3
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	May 18, 2023	1 Image, 2*2 StitchiAction Classification	CodeCode Available	3
Champion Solution for the WSDM2023 Toloka VQA Challenge	Jan 22, 2023	Question AnsweringVisual Grounding	CodeCode Available	3
Unifying Vision, Text, and Layout for Universal Document Processing	Dec 5, 2022	Document AIdocument understanding	CodeCode Available	3
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends	Oct 17, 2022	Few-Shot LearningImage Captioning	CodeCode Available	3
All You May Need for VQA are Image Captions	May 4, 2022	AllImage Captioning	CodeCode Available	3
OCR-free Document Understanding Transformer	Nov 30, 2021	Document Image Classificationdocument understanding	CodeCode Available	3
Ludwig: a type-based declarative deep learning toolbox	Sep 17, 2019	DecoderDeep Learning	CodeCode Available	3
Towards VQA Models That Can Read	Apr 18, 2019	TextVQAVisual Question Answering (VQA)	CodeCode Available	3
Pythia v0.1: the Winning Entry to the VQA Challenge 2018	Jul 26, 2018	Data AugmentationVisual Question Answering (VQA)	CodeCode Available	3
Bilinear Attention Networks	May 21, 2018	Visual Question AnsweringVisual Question Answering (VQA)	CodeCode Available	3
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models	Jun 11, 2025	counterfactualDescriptive	CodeCode Available	2
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis	Mar 25, 2025	Contrastive LearningImage-text Retrieval	CodeCode Available	2
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding	Mar 13, 2025	4kAutonomous Driving	CodeCode Available	2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories	Mar 11, 2025	Decision MakingInteractive Segmentation	CodeCode Available	2
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning	Mar 10, 2025	Language ModelingLanguage Modelling	CodeCode Available	2
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model	Mar 8, 2025	Image Quality AssessmentLanguage Modeling	CodeCode Available	2

Show:10 25 50

← PrevPage 3 of 87Next →

All datasets GQA Test2019 VQA v2 test-dev VQA v2 test-std OK-VQA MSVD-QA DocVQA test MSRVTT-QA InfographicVQA GQA test-dev VizWiz 2020 VQA A-OKVQA CLEVR

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	human	Accuracy	89.3	—	Unverified
2	DREAM+Unicoder-VL (MSRA)	Accuracy	76.04	—	Unverified
3	TRRNet (Ensemble)	Accuracy	74.03	—	Unverified
4	MIL-nbgao	Accuracy	73.81	—	Unverified
5	Kakao Brain	Accuracy	73.33	—	Unverified
6	Coarse-to-Fine Reasoning, Single Model	Accuracy	72.14	—	Unverified
7	270	Accuracy	70.23	—	Unverified
8	NSM ensemble (updated)	Accuracy	67.55	—	Unverified
9	VinVL-DPT	Accuracy	64.92	—	Unverified
10	VinVL+L	Accuracy	64.85	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	Accuracy	84.3	—	Unverified
2	BEiT-3	Accuracy	84.19	—	Unverified
3	VLMo	Accuracy	82.78	—	Unverified
4	ONE-PEACE	Accuracy	82.6	—	Unverified
5	mPLUG (Huge)	Accuracy	82.43	—	Unverified
6	CuMo-7B	Accuracy	82.2	—	Unverified
7	X2-VLM (large)	Accuracy	81.9	—	Unverified
8	MMU	Accuracy	81.26	—	Unverified
9	InternVL-C	Accuracy	81.2	—	Unverified
10	Lyrics	Accuracy	81.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BEiT-3	overall	84.03	—	Unverified
2	mPLUG-Huge	overall	83.62	—	Unverified
3	ONE-PEACE	overall	82.52	—	Unverified
4	X2-VLM (large)	overall	81.8	—	Unverified
5	VLMo	overall	81.3	—	Unverified
6	SimVLM	overall	80.34	—	Unverified
7	X2-VLM (base)	overall	80.2	—	Unverified
8	VAST	overall	80.19	—	Unverified
9	VALOR	overall	78.62	—	Unverified
10	Prompt Tuning	overall	78.53	—	Unverified