Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer

2022-05-01Findings (ACL) 2022Code Available0· sign in to hype

Nikolai Ilinykh, Simon Dobnik

Code Available — Be the first to reproduce this paper.

Code

github.com/gu-clasp/attention-as-grounding
OfficialIn papernone★ 1

Abstract

We explore how a multi-modal transformer trained for generation of longer image descriptions learns syntactic and semantic representations about entities and relations grounded in objects at the level of masked self-attention (text generation) and cross-modal attention (information fusion). We observe that cross-attention learns the visual grounding of noun phrases into objects and high-level semantic information about spatial relations, while text-to-text attention captures low-level syntactic knowledge between words. This concludes that language models in a multi-modal task learn different semantic information about objects and relations cross-modally and uni-modally (text-only). Our code is available here: https://github.com/GU-CLASP/attention-as-grounding.

Tasks

Text Generation Visual Grounding

Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer

Code

Abstract

Tasks

Reproductions