X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

2022-11-22Code Available2· sign in to hype

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, Wangchunshu Zhou

Code Available — Be the first to reproduce this paper.

Code

github.com/zengyan-97/x2-vlm
OfficialIn paperpytorch★ 169
github.com/zengyan-97/x-vlm
pytorch★ 501

Abstract

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X^2-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X^2-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X^2-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X^2-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X^2-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.

Tasks

All Cross-Modal Retrieval Image Captioning Text to Video Retrieval Video Question Answering Video Retrieval Visual Grounding Visual Question Answering (VQA)Visual Reasoning XLM-R

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
COCO 2014	X2-VLM (base)	Text-to-image R@1	66.2	—	Unverified
COCO 2014	X2-VLM (large)	Text-to-image R@1	67.7	—	Unverified
Flickr30k	X2-VLM (large)	Image-to-text R@1	98.8	—	Unverified
Flickr30k	X2-VLM (base)	Image-to-text R@1	98.5	—	Unverified

X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Code

Abstract

Tasks

Benchmark Results

Reproductions