Bilinear Attention Networks

2018-05-21NeurIPS 2018Code Available3· sign in to hype

Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang

Code Available — Be the first to reproduce this paper.

Code

github.com/jnhwkim/ban-vqa
Officialpytorch★ 549
github.com/facebookresearch/pythia
pytorch★ 5,626
github.com/ronghanghu/pythia
pytorch★ 45
github.com/Cyanogenoid/vqa-counting
pytorch★ 0
github.com/allenai/pythia
pytorch★ 0
github.com/jackroos/pythia
pytorch★ 0

Abstract

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

Tasks

Visual Question Answering Visual Question Answering (VQA)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Flickr30k Entities Test	BAN (Bottom-Up detector)	R@1	69.69	—	Unverified

Bilinear Attention Networks

Code

Abstract

Tasks

Benchmark Results

Reproductions