FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/facebookresearch/multimodalpytorch★ 1,706
- github.com/social-ai-studio/matkpytorch★ 13
- github.com/apsdehal/flava-tutorialsnone★ 12
- github.com/2024-MindSpore-1/Code2/tree/main/model-1/falconmindspore★ 0
Abstract
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| COCO (Common Objects in Context) | FLAVA (zero-shot) | recall@1 | 38.38 | — | Unverified |
| COCO (Common Objects in Context) | CLIP (zero-shot) | recall@1 | 33.29 | — | Unverified |