VQA: Visual Question Answering
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/vipulgupta1011/swapmixpytorch★ 20
- github.com/mkhalil1998/EC601_Group_Projectpytorch★ 2
- github.com/mokhalid-dev/Attention-based-VQA-modelpytorch★ 0
- github.com/ramprs/grad-camtorch★ 0
- github.com/yanxinyan1/yxypytorch★ 0
- github.com/moh833/VQAnone★ 0
- github.com/SatyamGaba/vqapytorch★ 0
- github.com/SatyamGaba/visual_question_answeringpytorch★ 0
- github.com/tbmoon/basic_vqapytorch★ 0
- github.com/SuchismitaSahu1993/VQA-Systemnone★ 0
Abstract
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| COCO Visual Question Answering (VQA) abstract 1.0 multiple choice | Dualnet ensemble | Percentage correct | 71.18 | — | Unverified |
| COCO Visual Question Answering (VQA) abstract 1.0 multiple choice | LSTM + global features | Percentage correct | 69.21 | — | Unverified |
| COCO Visual Question Answering (VQA) abstract 1.0 multiple choice | LSTM blind | Percentage correct | 61.41 | — | Unverified |
| COCO Visual Question Answering (VQA) abstract images 1.0 open ended | LSTM blind | Percentage correct | 57.19 | — | Unverified |
| COCO Visual Question Answering (VQA) abstract images 1.0 open ended | Dualnet ensemble | Percentage correct | 69.73 | — | Unverified |
| COCO Visual Question Answering (VQA) abstract images 1.0 open ended | LSTM + global features | Percentage correct | 65.02 | — | Unverified |
| COCO Visual Question Answering (VQA) real images 1.0 multiple choice | LSTM Q+I | Percentage correct | 63.1 | — | Unverified |
| COCO Visual Question Answering (VQA) real images 1.0 open ended | LSTM Q+I | Percentage correct | 58.2 | — | Unverified |
| COCO Visual Question Answering (VQA) real images 2.0 open ended | HDU-USYD-UNCC | Percentage correct | 68.16 | — | Unverified |
| COCO Visual Question Answering (VQA) real images 2.0 open ended | DLAIT | Percentage correct | 68.07 | — | Unverified |