SOTAVerified

Visual Question Answering (VQA)

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Papers

Showing 11511200 of 2167 papers

TitleStatusHype
QSAN: A Near-term Achievable Quantum Self-Attention Network0
Multiview Contrastive Learning for Completely Blind Video Quality Assessment of User Generated ContentCode0
Subjective and Objective Quality Assessment of High-Motion Sports Videos at Low-BitratesCode0
Video Graph Transformer for Video Question AnsweringCode1
ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named EntitiesCode1
Exploring the Effectiveness of Video Perceptual Representation in Blind Video Quality AssessmentCode0
OVQA: A Clinically Generated Visual Question Answering Dataset0
Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task LearningCode0
Weakly Supervised Grounding for VQA in Vision-Language TransformersCode1
VGNMN: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems0
American == White in Multimodal Language-and-Image AI0
Modern Question Answering Datasets and Benchmarks: A Survey0
A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQACode1
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering0
Consistency-preserving Visual Question Answering in Medical ImagingCode1
From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering0
Surgical-VQA: Visual Question Answering in Surgical Scenes using TransformerCode1
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason ObjectivesCode0
Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding0
Grounding Answers for Visual Questions Asked by Visually Impaired People0
DisCoVQA: Temporal Distortion-Content Transformers for Video Quality AssessmentCode0
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks0
Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsCode1
MixGen: A New Multi-Modal Data AugmentationCode1
Test-Time Adaptation for Visual Document Understanding0
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneCode1
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer LearningCode2
Language Models are General-Purpose Interfaces0
GLIPv2: Unifying Localization and Vision-Language UnderstandingCode4
Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model0
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge DistillationCode0
From Pixels to Objects: Cubic Visual Attention for Visual Question Answering0
A-OKVQA: A Benchmark for Visual Question Answering using World KnowledgeCode1
Structured Two-stream Attention Network for Video Question Answering0
VL-BEiT: Generative Vision-Language Pretraining0
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question AnsweringCode1
Question Modifiers in Visual Question Answering0
Fine-tuning vs From Scratch: Do Vision & Language Models Have Similar Capabilities on Out-of-Distribution Visual Question Answering?0
Un jeu de données pour répondre à des questions visuelles à propos d’entités nommées en utilisant des bases de connaissances (ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities)0
An Efficient Modern Baseline for FloodNet VQACode0
Visual Superordinate Abstraction for Robust Concept Learning0
GIT: A Generative Image-to-text Transformer for Vision and LanguageCode2
V-Doc : Visual questions answers with Documents0
Avoiding Barren Plateaus with Classical Deep Neural Networks0
Guiding Visual Question Answering with Attention Priors0
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsCode1
Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization0
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization0
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering0
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsCode1
Show:102550
← PrevPage 24 of 44Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1humanAccuracy89.3Unverified
2DREAM+Unicoder-VL (MSRA)Accuracy76.04Unverified
3TRRNet (Ensemble)Accuracy74.03Unverified
4MIL-nbgaoAccuracy73.81Unverified
5Kakao BrainAccuracy73.33Unverified
6Coarse-to-Fine Reasoning, Single ModelAccuracy72.14Unverified
7270Accuracy70.23Unverified
8NSM ensemble (updated)Accuracy67.55Unverified
9VinVL-DPTAccuracy64.92Unverified
10VinVL+LAccuracy64.85Unverified
#ModelMetricClaimedVerifiedStatus
1PaLIAccuracy84.3Unverified
2BEiT-3Accuracy84.19Unverified
3VLMoAccuracy82.78Unverified
4ONE-PEACEAccuracy82.6Unverified
5mPLUG (Huge)Accuracy82.43Unverified
6CuMo-7BAccuracy82.2Unverified
7X2-VLM (large)Accuracy81.9Unverified
8MMUAccuracy81.26Unverified
9LyricsAccuracy81.2Unverified
10InternVL-CAccuracy81.2Unverified
#ModelMetricClaimedVerifiedStatus
1BEiT-3overall84.03Unverified
2mPLUG-Hugeoverall83.62Unverified
3ONE-PEACEoverall82.52Unverified
4X2-VLM (large)overall81.8Unverified
5VLMooverall81.3Unverified
6SimVLMoverall80.34Unverified
7X2-VLM (base)overall80.2Unverified
8VASToverall80.19Unverified
9VALORoverall78.62Unverified
10Prompt Tuningoverall78.53Unverified