Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

2020-11-01Findings of the Association for Computational LinguisticsCode Available1· sign in to hype

Peter Jansen

Code Available — Be the first to reproduce this paper.

Code

github.com/cognitiveailab/alfred-gpt2
OfficialIn papernone★ 16

Abstract

The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as ``put a hot piece of bread on a plate''. Currently, the best-performing models are able to complete less than 1\% of these tasks successfully. In this work we focus on modeling the translation problem of converting natural language directives into detailed multi-step sequences of actions that accomplish those goals in the virtual environment. We empirically demonstrate that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26\% of unseen cases. When a small amount of visual information, the starting location in the virtual environment, is incorporated, our best-performing GPT-2 model successfully generates gold command sequences in 58\% of cases, suggesting contextualized language models may provide strong planning modules for grounded virtual agents.

Tasks

Translation

Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

Code

Abstract

Tasks

Reproductions