What You See is What You Read? Improving Text-Image Alignment Evaluation

2023-05-17NeurIPS 2023Code Available1· sign in to hype

Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

Code Available — Be the first to reproduce this paper.

Code

github.com/yonatanbitton/wysiwyr
Officialpytorch★ 37

Abstract

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

Tasks

Image Generation Image to text Question Answering Question Generation Question-Generation Text Generation Text to Image Generation Text-to-Image Generation Visual Question Answering Visual Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Winoground	VQ2	Text Score	47	—	Unverified
Winoground	PaLI (ft SNLI-VE + Synthetic Data)	Text Score	46.5	—	Unverified
Winoground	PaLI (ft SNLI-VE)	Text Score	45	—	Unverified
Winoground	BLIP2 (ft COCO)	Text Score	44	—	Unverified
Winoground	COCA ViT-L14 (f.t on COCO)	Text Score	28.25	—	Unverified
Winoground	OFA large (ft SNLI-VE)	Text Score	27.7	—	Unverified
Winoground	CLIP RN50x64	Text Score	26.5	—	Unverified
Winoground	TIFA	Text Score	19	—	Unverified

What You See is What You Read? Improving Text-Image Alignment Evaluation

Code

Abstract

Tasks

Benchmark Results

Reproductions