Multimodal Learning: Are Captions All You Need?

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

In today's digital world, it is increasingly common for information to be multimodal: images or videos often accompany text. Sophisticated multimodal architectures such as ViLBERT, VisualBERT, and LXMERT have achieved state-of-the-art performance in vision-and-language tasks. However, existing vision models cannot represent contextual information and semantics like transformer-based language models can. Fusing the semantic-rich information coming from text becomes a challenge. In this work, we study the alternative of first transforming images into text using image captioning. We then use transformer-based methods to combine the two modalities in a simple but effective way. We perform an empirical analysis on different multimodal tasks, describing the benefits, limitations, and situations where this simple approach can replace large and expensive handcrafted multimodal models.

Tasks

All Image Captioning

Multimodal Learning: Are Captions All You Need?

Abstract

Tasks

Reproductions