Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2022-06-16Code Available1· sign in to hype

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Code Available — Be the first to reproduce this paper.

Code

github.com/antoyang/FrozenBiLM
OfficialIn paperpytorch★ 158
github.com/klauscc/dam
pytorch★ 14
github.com/sts-vlcc/sts-vlcc
pytorch★ 1

Abstract

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

Tasks

Fill Mask Language Modeling Language Modelling Masked Language Modeling Question Answering TGIF-Frame Video Question Answering Visual Question Answering Visual Question Answering (VQA)Zero-Shot Learning Zeroshot Video Question Answer Zero-Shot Video Question Answer

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ActivityNet-QA	FrozenBiLM	Accuracy	43.2	—	Unverified
ActivityNet-QA	FrozenBiLM (0-shot)	Accuracy	25.9	—	Unverified
How2QA	FrozenBiLM	Accuracy	86.7	—	Unverified
How2QA	FrozenBiLM (0-shot)	Accuracy	58.4	—	Unverified
iVQA	FrozenBiLM	Accuracy	0.27	—	Unverified
iVQA	FrozenBiLM	Accuracy	39.6	—	Unverified
iVQA	FrozenBiLM (0-shot)	Accuracy	26.8	—	Unverified
MSRVTT-QA	FrozenBiLM	Accuracy	0.47	—	Unverified
MSRVTT-QA	FrozenBiLM	Accuracy	47	—	Unverified
MSRVTT-QA	FrozenBiLM (0-shot)	Accuracy	16.7	—	Unverified
TVQA	FrozenBiLM	Accuracy	82	—	Unverified

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions