SOTAVerified

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

2022-04-07CVPR 2022Code Available1· sign in to hype

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
WinogroundUNITER largeText Score38Unverified
WinogroundVinVLText Score37.75Unverified
WinogroundViLLA largeText Score37Unverified
WinogroundViLT (ViT-B/32)Text Score34.75Unverified
WinogroundFLAVA (ITM)Text Score32.25Unverified
WinogroundUNITER baseText Score32.25Unverified
WinogroundCLIP (ViT-B/32)Text Score30.75Unverified
WinogroundViLLA baseText Score30Unverified
WinogroundFLAVA (contrastive)Text Score25.25Unverified
WinogroundRandom chanceText Score25Unverified
WinogroundViLBERT baseText Score23.75Unverified
WinogroundVSE++ (COCO, ResNet)Text Score22.75Unverified
WinogroundVSRN (Flickr30k)Text Score20Unverified
WinogroundVSE++ (Flickr30k, ResNet)Text Score20Unverified
WinogroundVSE++ (Flickr30k, VGG)Text Score19.75Unverified
WinogroundUniT (ITM finetuned)Text Score19.5Unverified
WinogroundLXMERTText Score19.25Unverified
WinogroundVSE++ (COCO, VGG)Text Score18.75Unverified
WinogroundVSRN (COCO)Text Score17.5Unverified
WinogroundVisualBERT baseText Score15.5Unverified

Reproductions