SOTAVerified

End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space

2018-11-01WS 2018Code Available0· sign in to hype

Pranava Swaroop Madhyastha, Josiah Wang, Lucia Specia

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the `image' side of image captioning, and vary the input image representation but keep the RNN text generation model of a CNN-RNN constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) experience virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our experiments all point to one fact: that our distributional similarity hypothesis holds. We conclude that, regardless of the image representation, image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

Tasks

Reproductions