Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

2022-07-01NAACL 2022Unverified0· sign in to hype

Jianguo Mao, Wenbin Jiang, Xiangdong Wang, Zhifan Feng, Yajuan Lyu, Hong Liu, Yong Zhu

Unverified — Be the first to reproduce this paper.

Abstract

Existing video question answering (video QA) models lack the capacity for deep video understanding and flexible multistep reasoning. We propose for video QA a novel model which performs dynamic multistep reasoning between questions and videos. It creates video semantic representation based on the video scene graph composed of semantic elements of the video and semantic relations among these elements. Then, it performs multistep reasoning for better answer decision between the representations of the question and the video, and dynamically integrate the reasoning results. Experiments show the significant advantage of the proposed model against previous methods in accuracy and interpretability. Against the existing state-of-the-art model, the proposed model dramatically improves more than 4\%/3.1\%/2\% on the three widely used video QA datasets, MSRVTT-QA, MSRVTT multi-choice, and TGIF-QA, and displays better interpretability by backtracing along with the attention mechanisms to the video scene graphs.

Tasks

Question Answering Video Question Answering Video Understanding

Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

Abstract

Tasks

Reproductions