Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

2022-08-22Code Available0· sign in to hype

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/microsoft/unilm/tree/master/beit
Officialpytorch★ 0
github.com/lyan62/data-curation
pytorch★ 8

Abstract

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).

Tasks

All Cross-Modal Retrieval Image Captioning image-classification Image Classification Instance Segmentation Language Modeling Language Modelling Masked Language Modeling Object Detection Question Answering Retrieval Semantic Segmentation Visual Question Answering Visual Question Answering (VQA)Visual Reasoning Zero-Shot Cross-Modal Retrieval

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Code

Abstract

Tasks

Benchmark Results

Reproductions