Simple Token-Level Confidence Improves Caption Correctness

2023-05-11Unverified0· sign in to hype

Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

Unverified — Be the first to reproduce this paper.

Abstract

The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.

Tasks

Hallucination Image Captioning Language Modelling Object Hallucination Visual Reasoning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Winoground	OFA large (ITM)	Text Score	30.75	—	Unverified
Winoground	OFA large (TLC-A)	Text Score	29.25	—	Unverified
Winoground	OFA base (ITM)	Text Score	26.75	—	Unverified
Winoground	OFA base (TLC-A)	Text Score	24.5	—	Unverified
Winoground	OFA tiny (ITM)	Text Score	22.75	—	Unverified
Winoground	OFA tiny (TLC-A)	Text Score	16.5	—	Unverified

Simple Token-Level Confidence Improves Caption Correctness

Abstract

Tasks

Benchmark Results

Reproductions