Scene Text Visual Question Answering

2019-05-31ICCV 2019Code Available1· sign in to hype

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, Dimosthenis Karatzas

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/rubenpt91/MP-DocVQA-Framework
pytorch★ 69
github.com/rubenpt91/pfl-docvqa-competition
pytorch★ 21
github.com/shailzajolly/icdar_vqa
pytorch★ 0
github.com/Gunnika/Visual-Question-Answering
pytorch★ 0

Abstract

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

Tasks

Question Answering Visual Question Answering Visual Question Answering (VQA)

Scene Text Visual Question Answering

Code

Abstract

Tasks

Reproductions