ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities
Anonymous
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/anonymous/viquaeOfficialIn paper★ 0
Abstract
Whether to retrieve, answer, translate or reason, multimodality opens up new challenges and perspectives. In this context, we are interested in Knowledge-based Visual Question Answering about named Entities (KVQAE). To benchmark the task, we provide ViQuAE, a dataset of 3.7K questions paired with images. This is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, products). The dataset is annotated using a semi-automatic method that could be extended to larger data scales. We also propose a Knowledge Base (KB) based on Wikipedia composed of 1.5M articles paired with images. To set a baseline on the benchmark, we address KVQAE as a two-stage problem: Information Retrieval (IR) and Reading Comprehension (RC). IR is carried out with a combination of face recognition, image retrieval and text retrieval while RC is purely text-based. The experiments empirically demonstrate the difficulty of the task. This work paves the way towards better multimodal entity representations and question answering. The dataset, KB and code will be available at https://github.com/Anonymous/ViQuAE.