Visual Question Answering (VQA)

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 501–525 of 2167 papers

Title	Date	Tasks	Status	Hype
LIVE: Learnable In-Context Vector for Visual Question Answering	Jun 19, 2024	In-Context LearningQuestion Answering	CodeCode Available	1
Biomedical Visual Instruction Tuning with Clinician Preference Alignment	Jun 19, 2024	Instruction FollowingVisual Question Answering (VQA)	CodeCode Available	0
Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQA	Jun 18, 2024	Question AnsweringVisual Question Answering	CodeCode Available	0
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model	Jun 17, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture	Jun 16, 2024	DiversityMultiple-choice	CodeCode Available	1
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models	Jun 16, 2024	HallucinationHallucination Evaluation	CodeCode Available	3
Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model	Jun 15, 2024	Question AnsweringVideo Understanding	CodeCode Available	0
What is the Visual Cognition Gap between Humans and Multimodal LLMs?	Jun 14, 2024	object-detectionObject Detection	CodeCode Available	0
Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models	Jun 14, 2024	DecoderKnowledge Graphs	—Unverified	0
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps	Jun 14, 2024	Question AnsweringVisual Question Answering	CodeCode Available	1
Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns	Jun 13, 2024	Autonomous DrivingQuestion Answering	—Unverified	0
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	Jun 12, 2024	Image GenerationLanguage Modeling	CodeCode Available	5
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	Jun 11, 2024	Multiple-choiceQuestion Answering	CodeCode Available	5
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark	Jun 10, 2024	DiversityQuestion Answering	—Unverified	0
Composition Vision-Language Understanding via Segment and Depth Anything Model	Jun 7, 2024	Question AnsweringVisual Question Answering (VQA)	CodeCode Available	0
Understanding Information Storage and Transfer in Multi-modal Large Language Models	Jun 6, 2024	Factual Visual Question AnsweringModel Editing	—Unverified	0
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following	Jun 4, 2024	Question AnsweringVisual Question Answering	CodeCode Available	0
Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering	Jun 4, 2024	Data AugmentationMachine Translation	—Unverified	0
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models	Jun 3, 2024	Image CaptioningLanguage Modelling	CodeCode Available	2
Selectively Answering Visual Questions	Jun 3, 2024	AvgIn-Context Learning	—Unverified	0
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy	Jun 3, 2024	Language ModellingQuestion Answering	CodeCode Available	2
Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering	Jun 3, 2024	DiversityQuestion Answering	—Unverified	0
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	May 31, 2024	cross-modal alignmentVisual Localization	CodeCode Available	2
Ovis: Structural Embedding Alignment for Multimodal Large Language Model	May 31, 2024	Language ModelingMultimodal Large Language Model	CodeCode Available	5
VQA Training Sets are Self-play Environments for Generating Few-shot Pools	May 30, 2024	Question AnsweringVisual Question Answering	—Unverified	0

Show:10 25 50

← PrevPage 21 of 87Next →

All datasets GQA Test2019 VQA v2 test-dev VQA v2 test-std OK-VQA MSVD-QA DocVQA test MSRVTT-QA InfographicVQA GQA test-dev VizWiz 2020 VQA A-OKVQA CLEVR

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	human	Accuracy	89.3	—	Unverified
2	DREAM+Unicoder-VL (MSRA)	Accuracy	76.04	—	Unverified
3	TRRNet (Ensemble)	Accuracy	74.03	—	Unverified
4	MIL-nbgao	Accuracy	73.81	—	Unverified
5	Kakao Brain	Accuracy	73.33	—	Unverified
6	Coarse-to-Fine Reasoning, Single Model	Accuracy	72.14	—	Unverified
7	270	Accuracy	70.23	—	Unverified
8	NSM ensemble (updated)	Accuracy	67.55	—	Unverified
9	VinVL-DPT	Accuracy	64.92	—	Unverified
10	VinVL+L	Accuracy	64.85	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	Accuracy	84.3	—	Unverified
2	BEiT-3	Accuracy	84.19	—	Unverified
3	VLMo	Accuracy	82.78	—	Unverified
4	ONE-PEACE	Accuracy	82.6	—	Unverified
5	mPLUG (Huge)	Accuracy	82.43	—	Unverified
6	CuMo-7B	Accuracy	82.2	—	Unverified
7	X2-VLM (large)	Accuracy	81.9	—	Unverified
8	MMU	Accuracy	81.26	—	Unverified
9	Lyrics	Accuracy	81.2	—	Unverified
10	InternVL-C	Accuracy	81.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	BEiT-3	overall	84.03	—	Unverified
2	mPLUG-Huge	overall	83.62	—	Unverified
3	ONE-PEACE	overall	82.52	—	Unverified
4	X2-VLM (large)	overall	81.8	—	Unverified
5	VLMo	overall	81.3	—	Unverified
6	SimVLM	overall	80.34	—	Unverified
7	X2-VLM (base)	overall	80.2	—	Unverified
8	VAST	overall	80.19	—	Unverified
9	VALOR	overall	78.62	—	Unverified
10	Prompt Tuning	overall	78.53	—	Unverified