FLAVA: A Foundational Language And Vision Alignment Model

2021-12-08CVPR 2022Code Available1· sign in to hype

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

Code Available — Be the first to reproduce this paper.

Code

github.com/social-ai-studio/matk
pytorch★ 13
github.com/apsdehal/flava-tutorials
none★ 12
github.com/2024-MindSpore-1/Code2/tree/main/model-1/falcon
mindspore★ 0

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

Tasks

Image Retrieval Image-to-Text Retrieval Visual Reasoning Zero-shot Image Retrieval Zero-shot Text Retrieval Zero-shot Text-to-Image Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
COCO (Common Objects in Context)	FLAVA (zero-shot)	recall@1	38.38	—	Unverified
COCO (Common Objects in Context)	CLIP (zero-shot)	recall@1	33.29	—	Unverified

FLAVA: A Foundational Language And Vision Alignment Model

Code

Abstract

Tasks

Benchmark Results

Reproductions