Measuring Progress in Fine-grained Vision-and-Language Understanding

2023-05-12Code Available1· sign in to hype

Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, Aida Nematzadeh

Code Available — Be the first to reproduce this paper.

Code

github.com/e-bug/fine-grained-evals
OfficialIn paperpytorch★ 13
github.com/e-bug/weak-relation-vlm
pytorch★ 3

Abstract

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

Tasks

Visual Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Winoground	X-VLM 16M	Text Score	46.7	—	Unverified
Winoground	X-VLM 4M	Text Score	44	—	Unverified
Winoground	BLIP 14M	Text Score	36.5	—	Unverified
Winoground	BLIP 129M	Text Score	35.5	—	Unverified
Winoground	BLIP 129M (CapFilt/L)	Text Score	34.7	—	Unverified
Winoground	BLIP-ViT/L 129M	Text Score	34.7	—	Unverified
Winoground	PEVL 14M	Text Score	33.2	—	Unverified
Winoground	ALBEF 14M	Text Score	32.5	—	Unverified
Winoground	ALBEF 4M	Text Score	29.2	—	Unverified

Measuring Progress in Fine-grained Vision-and-Language Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions