ClipCap: CLIP Prefix for Image Captioning

2021-11-18Code Available2· sign in to hype

Ron Mokady, Amir Hertz, Amit H. Bermano

Code Available — Be the first to reproduce this paper.

Code

github.com/rmokady/clip_prefix_caption
OfficialIn paperpytorch★ 1,414
github.com/Japanese-Image-Captioning/ClipCap-for-Japanese
pytorch★ 12
github.com/sithu31296/image-captioning
pytorch★ 9
github.com/MS-P3/code7/tree/main/x_clip
mindspore★ 0

Abstract

Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.

Tasks

Image Captioning Language Modeling Language Modelling

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
COCO Captions	ClipCap (Transformer)	BLEU-4	33.53	—	Unverified
COCO Captions	ClipCap (MLP + GPT2 tuning)	BLEU-4	32.15	—	Unverified
Conceptual Captions	ClipCap (MLP + GPT2 tuning)	CIDEr	87.26	—	Unverified
Conceptual Captions	ClipCap (Transformer)	CIDEr	71.82	—	Unverified
nocaps entire	ClipCap (Transformer)	CIDEr	65.83	—	Unverified
nocaps entire	ClipCap (MLP + GPT2 tuning)	CIDEr	65.7	—	Unverified
nocaps in-domain	ClipCap (MLP + GPT2 tuning)	CIDEr	79.73	—	Unverified
nocaps in-domain	ClipCap (Transformer)	CIDEr	84.85	—	Unverified
nocaps near-domain	ClipCap (MLP + GPT2 tuning)	CIDEr	67.69	—	Unverified
nocaps near-domain	ClipCap (Transformer)	CIDEr	66.82	—	Unverified
nocaps out-of-domain	ClipCap (MLP + GPT2 tuning)	CIDEr	49.35	—	Unverified
nocaps out-of-domain	ClipCap (Transformer)	CIDEr	49.14	—	Unverified

ClipCap: CLIP Prefix for Image Captioning

Code

Abstract

Tasks

Benchmark Results

Reproductions