Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

2022-10-17Code Available3· sign in to hype

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao

Code Available — Be the first to reproduce this paper.

Code

github.com/computer-vision-in-the-wild/cvinw_readings
OfficialIn papernone★ 1,363

Abstract

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.

Tasks

Few-Shot Learning Image Captioning image-classification Image Classification Image-text Retrieval object-detection Object Detection Question Answering Retrieval Text Retrieval Video Captioning Video Question Answering Video-Text Retrieval Visual Grounding Visual Question Answering Visual Question Answering (VQA)

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Code

Abstract

Tasks

Reproductions