SOTAVerified

Extended Abstract: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

2020-06-12ICML Workshop LaReL 2020Unverified0· sign in to hype

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Following a navigation instruction such as 'Walk down the stairs and stop near the sofa' requires an agent to ground scene elements referenced via language (e.g.'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions (Sharma et al., 2018)) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer model that scores the compatibility between an instruction ('...stop near the sofa') and a sequence of panoramic images. We demonstrate that pretraining VLN-BERT on image-text pairs from the web significantly improves performance on VLN -- outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further synergistic effects.

Tasks

Reproductions