Revisiting the "Video" in Video-Language Understanding

2022-06-03CVPR 2022Code Available1· sign in to hype

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, Juan Carlos Niebles

Code Available — Be the first to reproduce this paper.

Code

github.com/stanfordvl/atp-video-language
pytorch★ 51

Abstract

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.

Tasks

Benchmarking Question Answering Retrieval Text to Video Retrieval Video Question Answering Video Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
How2QA	ATP	Accuracy	65.1	—	Unverified
MSR-VTT-MC	ATP (1<-16)	Accuracy	93.2	—	Unverified
NExT-QA	ATP	Accuracy	54.3	—	Unverified
STAR Benchmark	Temp[ATP]	Average Accuracy	48.37	—	Unverified

Revisiting the "Video" in Video-Language Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions