Predicting Visual Futures with Image Captioning and Pre-Trained Language Models

2021-06-16ACL ARR Jun 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

The task of visual forecasting deals with predicting future events from a sequence of input images. Purely pixel-based approaches find this challenging due to the presence of abstract concepts and temporal events at different timescales. In this paper, we present an approach that combines image captioning with pre-trained language models to predict visual futures. By leveraging language as an intermediate medium, our model is able to perform more effective temporal reasoning on two different tasks -- visual story cloze and action forecasting. Despite making the final predictions using only the generated captions, our approach outperforms state-of-the-art systems by 4\% and 6\% respectively on the two tasks. We find that our model consistently picks images/actions that are semantically relevant to the given image sequence instead of simply relying on visual similarity.

Tasks

Image Captioning

Predicting Visual Futures with Image Captioning and Pre-Trained Language Models

Abstract

Tasks

Reproductions