Visual Goal-Step Inference using wikiHow

2021-04-12EMNLP 2021Code Available0· sign in to hype

Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch

Code Available — Be the first to reproduce this paper.

Code

github.com/yueyang1996/wikihow-vgsi
OfficialIn papernone★ 0

Abstract

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.

Tasks

Multimodal Reasoning VGSI

Visual Goal-Step Inference using wikiHow

Code

Abstract

Tasks

Reproductions