Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

2021-02-17CVPR 2021Code Available1· sign in to hype

Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut

Code Available — Be the first to reproduce this paper.

Code

github.com/google-research-datasets/conceptual-12m
OfficialIn papernone★ 421
github.com/facebookresearch/meru
pytorch★ 200
github.com/gicheonkang/gst-visdial
pytorch★ 20

Abstract

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

Tasks

Caption Generation Diversity Image Captioning Question Answering Visual Question Answering Visual Question Answering (VQA)

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
nocaps-val-in-domain	Enc-Dec	CIDEr	92.6	—	Unverified
nocaps-val-near-domain	Enc-Dec	CIDEr	88.3	—	Unverified
nocaps-val-out-domain	Enc-Dec	CIDEr	94.5	—	Unverified
nocaps-val-overall	Enc-Dec	CIDEr	90.2	—	Unverified

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Code

Abstract

Tasks

Benchmark Results

Reproductions