Deep Modular Co-Attention Networks for Visual Question Answering

2019-06-25CVPR 2019Code Available0· sign in to hype

Zhou Yu, Jun Yu, Yuhao Cui, DaCheng Tao, Qi Tian

Code Available — Be the first to reproduce this paper.

Code

github.com/MILVLG/mcan-vqa
OfficialIn paperpytorch★ 0
github.com/ThanThoai/Visual-Question-Answering_Vietnamese
pytorch★ 8
github.com/hieunghia-pat/UIT-MCAN
pytorch★ 2
github.com/vikrantmane7781/detectroon2
pytorch★ 0
github.com/apugoneappu/vqa_visualise
pytorch★ 0
github.com/apugoneappu/ask_me_anything
pytorch★ 0
github.com/straightAYiJun/vqa-attention-visualize-system
pytorch★ 0

Abstract

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63\% overall accuracy on the test-dev set. Code is available at https://github.com/MILVLG/mcan-vqa.

Tasks

Question Answering Visual Question Answering Visual Question Answering (VQA)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
SQA3D	MCAN	AnswerExactMatch (Question Answering)	43.42	—	Unverified

Deep Modular Co-Attention Networks for Visual Question Answering

Code

Abstract

Tasks

Benchmark Results

Reproductions