LXMERT: Learning Cross-Modality Encoder Representations from Transformers

2019-08-20IJCNLP 2019Code Available1· sign in to hype

Hao Tan, Mohit Bansal

Code Available — Be the first to reproduce this paper.

Code

github.com/airsplay/lxmert
OfficialIn paperpytorch★ 0
github.com/zhegan27/VILLA
pytorch★ 119
github.com/zhegan27/LXMERT-AdvTrain
pytorch★ 21
github.com/social-ai-studio/matk
pytorch★ 13
github.com/ghazaleh-mahmoodi/lxmert_compression
pytorch★ 5
github.com/itsShnik/adaptively-finetuning-transformers
pytorch★ 0
github.com/Mind23-2/MindCode-156
mindspore★ 0
github.com/chaitanyadwivedii/3D-Attention-is-All-You-Need
pytorch★ 0

Abstract

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pre-training strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert

Tasks

Language Modeling Language Modelling Masked Language Modeling Question Answering Sentence Visual Question Answering Visual Question Answering (VQA)Visual Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
A-OKVQA	LXMERT	MC Accuracy	41.6	—	Unverified
GQA Test2019	LXR955, Ensemble	Accuracy	62.71	—	Unverified
GQA Test2019	LXR955, Single Model	Accuracy	60.33	—	Unverified
GQA test-dev	LXMERT (Pre-train + scratch)	Accuracy	60	—	Unverified
GQA test-std	LXMERT	Accuracy	60.3	—	Unverified
VizWiz 2018	LXR955, No Ensemble	overall	55.4	—	Unverified
VQA v2 test-dev	LXMERT (Pre-train + scratch)	Accuracy	69.9	—	Unverified
VQA v2 test-std	LXMERT	overall	72.5	—	Unverified

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions