YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension

2019-11-01IJCNLP 2019Unverified0· sign in to hype

Weiying Wang, Yongcheng Wang, Shi-Zhe Chen, Qin Jin

Unverified — Be the first to reproduce this paper.

Abstract

Multimodal semantic comprehension has attracted increasing research interests recently such as visual question answering and caption generation. However, due to the data limitation, fine-grained semantic comprehension has not been well investigated, which requires to capture semantic details of multimodal contents. In this work, we introduce ``YouMakeup'', a large-scale multimodal instructional video dataset to support fine-grained semantic comprehension research in specific domain. YouMakeup contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of natural language descriptions for instructional steps, grounded in temporal video range and spatial facial areas. The annotated steps in a video involve subtle difference in actions, products and regions, which requires fine-grained understanding and reasoning both temporally and spatially. In order to evaluate models' ability for fined-grained comprehension, we further propose two groups of tasks including generation tasks and visual question answering from different aspects. We also establish a baseline of step caption generation for future comparison. The dataset will be publicly available at https://github. com/AIM3-RUC/YouMakeup to support research investigation in fine-grained semantic comprehension.

Tasks

Caption Generation Question Answering Visual Question Answering Visual Question Answering (VQA)

YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension

Abstract

Tasks

Reproductions