UNITER: UNiversal Image-TExt Representation Learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/ChenRocks/UNITEROfficialIn paperpytorch★ 0
- github.com/YIKUAN8/Transformers-VQApytorch★ 166
- github.com/necla-ml/SNLI-VEnone★ 120
- github.com/vladsandulescu/hatefulmemespytorch★ 13
- github.com/xiaomin418/cfsumpytorch★ 12
- github.com/lichengunc/pretrain-vl-datanone★ 0
- github.com/SDLZY/VCR_Alignpytorch★ 0
Abstract
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR^2. Code is available at https://github.com/ChenRocks/UNITER.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| VCR (Q-AR) test | UNITER (Large) | Accuracy | 62.8 | — | Unverified |
| VCR (QA-R) test | UNITER-large (ensemble of 10 models) | Accuracy | 83.4 | — | Unverified |
| VCR (QA-R) test | UNITER (Large) | Accuracy | 80.8 | — | Unverified |
| VCR (Q-A) test | UNITER-large (10 ensemble) | Accuracy | 79.8 | — | Unverified |
| VCR (Q-A) test | UNITER (Large) | Accuracy | 77.3 | — | Unverified |
| VQA v2 test-dev | UNITER (Large) | Accuracy | 73.24 | — | Unverified |
| VQA v2 test-std | UNITER (Large) | overall | 73.4 | — | Unverified |