SOTAVerified

Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning

2019-07-01ACL 2019Code Available0· sign in to hype

Zhihao Fan, Zhongyu Wei, Siyuan Wang, Xuanjing Huang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Image Captioning aims at generating a short description for an image. Existing research usually employs the architecture of CNN-RNN that views the generation as a sequential decision-making process and the entire dataset vocabulary is used as decoding space. They suffer from generating high frequent n-gram with irrelevant words. To tackle this problem, we propose to construct an image-grounded vocabulary, based on which, captions are generated with limitation and guidance. In specific, a novel hierarchical structure is proposed to construct the vocabulary incorporating both visual information and relations among words. For generation, we propose a word-aware RNN cell incorporating vocabulary information into the decoding process directly. Reinforce algorithm is employed to train the generator using constraint vocabulary as action space. Experimental results on MS COCO and Flickr30k show the effectiveness of our framework compared to some state-of-the-art models.

Tasks

Reproductions