Visual Question Answering (VQA)

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 2167 papers

Title	Date	Tasks	Status	Hype	Score
SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning	Aug 10, 2024	HallucinationOptical Character Recognition	CodeCode Available	11	5
Qwen2.5-VL Technical Report	Feb 19, 2025	document understanding	CodeCode Available	11	5
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Sep 18, 2024	Natural Language Visual Grounding	CodeCode Available	11	5
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	Oct 24, 2024	Image GenerationQuestion Generation	CodeCode Available	7	5
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Apr 20, 2023	Image DescriptionLanguage Modelling	CodeCode Available	7	5
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	Aug 9, 2024	Language ModelingLanguage Modelling	CodeCode Available	7	5
GPT-4 Technical Report	Mar 15, 2023	answerability predictionArithmetic Reasoning	CodeCode Available	6	5
Improved Baselines with Visual Instruction Tuning	Oct 5, 2023	Factual Inconsistency Detection in Chart CaptioningImage Classification	CodeCode Available	6	5
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	Jun 11, 2024	Multiple-choiceQuestion Answering	CodeCode Available	5	5
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	Jun 12, 2024	Image GenerationLanguage Modeling	CodeCode Available	5	5
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	May 31, 2024	Language ModelingMultimodal Large Language Model	CodeCode Available	5	5
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	Mar 28, 2023	Instruction FollowingLanguage Modelling	CodeCode Available	5	5
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	Mar 7, 2024	document understandingKey Information Extraction	CodeCode Available	5	5
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	Aug 24, 2023	Chart Question AnsweringFS-MEVQA	CodeCode Available	5	5
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	Apr 28, 2023	Instruction Followingmodel	CodeCode Available	5	5
CogAgent: A Visual Language Model for GUI Agents	Dec 14, 2023	Language Modeling	CodeCode Available	5	5
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	Jan 28, 2022	Image CaptioningImage-text matching	CodeCode Available	5	5
CogVLM: Visual Expert for Pretrained Language Models	Nov 6, 2023	1 Image, 2*2 StitchingFS-MEVQA	CodeCode Available	5	5
Otter: A Multi-Modal Model with In-Context Instruction Tuning	May 5, 2023	GPUIn-Context Learning	CodeCode Available	4	5
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	Aug 2, 2023	Visual Question AnsweringVisual Question Answering (VQA)	CodeCode Available	4	5
GLIPv2: Unifying Localization and Vision-Language Understanding	Jun 12, 2022	2D Object DetectionContrastive Learning	CodeCode Available	4	5
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning	May 2, 2024	Autonomous Drivingcounterfactual	CodeCode Available	4	5
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	Nov 7, 2023	1 Image, 2*2 StitchingDecoder	CodeCode Available	4	5
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Apr 27, 2023	Visual Question Answering (VQA)Zero-Shot Video Question Answer	CodeCode Available	4	5
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	Feb 5, 2024	Science Question AnsweringText-to-Video Generation	CodeCode Available	4	5
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM	Feb 14, 2024	Medical Visual Question AnsweringQuestion Answering	CodeCode Available	4	5
Exploring the Capabilities of Large Multimodal Models on Dense Text	May 9, 2024	Prompt EngineeringVisual Question Answering (VQA)	CodeCode Available	4	5
Multi-label Cluster Discrimination for Visual Representation Learning	Jul 24, 2024	Contrastive LearningImage-text Retrieval	CodeCode Available	4	5
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Nov 16, 2023	Language ModelingLanguage Modelling	CodeCode Available	4	5
Long Context Transfer from Language to Vision	Jun 24, 2024	Language ModelingLanguage Modelling	CodeCode Available	4	5
Flamingo: a Visual Language Model for Few-Shot Learning	Apr 29, 2022	Few-Shot LearningGenerative Visual Question Answering	CodeCode Available	4	5
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	Nov 13, 2023	Described Object DetectionLanguage Modeling	CodeCode Available	4	5
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video	Feb 1, 2023	Action ClassificationImage Classification	CodeCode Available	4	5
InternVideo: General Video Foundation Models via Generative and Discriminative Learning	Dec 6, 2022	Action ClassificationAction Recognition	CodeCode Available	4	5
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Jan 30, 2023	Generative Visual Question AnsweringImage Captioning	CodeCode Available	4	5
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token	Jan 7, 2025	GPUVisual Question Answering (VQA)	CodeCode Available	4	5
Tarsier: Recipes for Training and Evaluating Large Video Description Models	Jun 30, 2024	Video CaptioningVideo Description	CodeCode Available	4	5
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale	Jun 27, 2024	Visual Question Answering (VQA)	CodeCode Available	3	5
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent	Nov 5, 2024	BenchmarkingHallucination	CodeCode Available	3	5
All You May Need for VQA are Image Captions	May 4, 2022	AllImage Captioning	CodeCode Available	3	5
Emu: Generative Pretraining in Multimodality	Jul 11, 2023	Image CaptioningImage Generation	CodeCode Available	3	5
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities	May 18, 2023	1 Image, 2*2 StitchiAction Classification	CodeCode Available	3	5
Evaluating Text-to-Visual Generation with Image-to-Text Generation	Apr 1, 2024	Image to textQuestion Answering	CodeCode Available	3	5
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models	Oct 16, 2024	DiagnosticHallucination	CodeCode Available	3	5
MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making	Apr 22, 2024	Decision MakingMedical Diagnosis	CodeCode Available	3	5
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	Apr 8, 2024	GPUMultiple-choice	CodeCode Available	3	5
MMSearch-R1: Incentivizing LMMs to Search	Jun 25, 2025	RAGRetrieval-augmented Generation	CodeCode Available	3	5
DriveLM: Driving with Graph Visual Question Answering	Dec 21, 2023	Autonomous DrivingQuestion Answering	CodeCode Available	3	5
OCR-free Document Understanding Transformer	Nov 30, 2021	Document Image Classificationdocument understanding	CodeCode Available	3	5
Ludwig: a type-based declarative deep learning toolbox	Sep 17, 2019	DecoderDeep Learning	CodeCode Available	3	5

Show:10 25 50

← PrevPage 1 of 44Next →

All datasets GQA Test2019 VQA v2 test-dev VQA v2 test-std OK-VQA MSVD-QA DocVQA test MSRVTT-QA InfographicVQA GQA test-dev VizWiz 2020 VQA A-OKVQA CLEVR

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	human	Accuracy	89.3	—	Unverified
2	DREAM+Unicoder-VL (MSRA)	Accuracy	76.04	—	Unverified
3	TRRNet (Ensemble)	Accuracy	74.03	—	Unverified
4	MIL-nbgao	Accuracy	73.81	—	Unverified
5	Kakao Brain	Accuracy	73.33	—	Unverified
6	Coarse-to-Fine Reasoning, Single Model	Accuracy	72.14	—	Unverified
7	270	Accuracy	70.23	—	Unverified
8	NSM ensemble (updated)	Accuracy	67.55	—	Unverified
9	VinVL-DPT	Accuracy	64.92	—	Unverified
10	VinVL+L	Accuracy	64.85	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	Accuracy	84.3	—	Unverified
2	BEiT-3	Accuracy	84.19	—	Unverified
3	VLMo	Accuracy	82.78	—	Unverified
4	ONE-PEACE	Accuracy	82.6	—	Unverified
5	mPLUG (Huge)	Accuracy	82.43	—	Unverified
6	CuMo-7B	Accuracy	82.2	—	Unverified
7	X2-VLM (large)	Accuracy	81.9	—	Unverified
8	MMU	Accuracy	81.26	—	Unverified
9	InternVL-C	Accuracy	81.2	—	Unverified
10	Lyrics	Accuracy	81.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BEiT-3	overall	84.03	—	Unverified
2	mPLUG-Huge	overall	83.62	—	Unverified
3	ONE-PEACE	overall	82.52	—	Unverified
4	X2-VLM (large)	overall	81.8	—	Unverified
5	VLMo	overall	81.3	—	Unverified
6	SimVLM	overall	80.34	—	Unverified
7	X2-VLM (base)	overall	80.2	—	Unverified
8	VAST	overall	80.19	—	Unverified
9	VALOR	overall	78.62	—	Unverified
10	Prompt Tuning	overall	78.53	—	Unverified