Zero-Shot Visual Grounding of Referring Utterances in Dialogue

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

This work explores whether current pretrained multimodal models, which are optimized to align images and captions, can be applied to the rather different domain of referring expressions. In particular, we test whether one such model, CLIP, is effective in capturing two main trends observed for referential chains uttered within a multimodal dialogue, i.e., that utterances become less descriptive over time while their discriminativeness remains unchanged. We show that CLIP captures both, which opens up the possibility to use these models for reference resolution and generation. Moreover, our analysis indicates a possible role for these architectures toward discovering the mechanisms employed by humans when referring to visual entities.

Tasks

Descriptive Visual Grounding

Zero-Shot Visual Grounding of Referring Utterances in Dialogue

Abstract

Tasks

Reproductions