Towards Multi-Modal Text-Image Retrieval to improve Human Reading

2021-06-01NAACL 2021Unverified0· sign in to hype

Florian Schneider, {\"O}zge Ala{\c{c}}am, Xintong Wang, Chris Biemann

Unverified — Be the first to reproduce this paper.

Abstract

In primary school, children's books, as well as in modern language learning apps, multi-modal learning strategies like illustrations of terms and phrases are used to support reading comprehension. Also, several studies in educational psychology suggest that integrating cross-modal information will improve reading comprehension. We claim that state-of- he-art multi-modal transformers, which could be used in a language learner context to improve human reading, will perform poorly because of the short and relatively simple textual data those models are trained with. To prove our hypotheses, we collected a new multi-modal image-retrieval dataset based on data from Wikipedia. In an in-depth data analysis, we highlight the differences between our dataset and other popular datasets. Additionally, we evaluate several state-of-the-art multi-modal transformers on text-image retrieval on our dataset and analyze their meager results, which verify our claims.

Tasks

Image Retrieval Reading Comprehension Retrieval

Towards Multi-Modal Text-Image Retrieval to improve Human Reading

Abstract

Tasks

Reproductions