Towards Long-Form Video Understanding

2021-06-21CVPR 2021Code Available1· sign in to hype

Chao-yuan Wu, Philipp Krähenbühl

Code Available — Be the first to reproduce this paper.

Code

github.com/chaoyuaw/lvu
OfficialIn paperpytorch★ 87
github.com/md-mohaiminul/ViS4mer
pytorch★ 58

Abstract

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.

Tasks

Action Recognition Form Video Recognition Video Understanding

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AVA v2.2	Object Transformer	mAP	31	—	Unverified

Towards Long-Form Video Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions