VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2021-09-28EMNLP 2021Code Available0· sign in to hype

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/pytorch/fairseq
OfficialIn paper★ 0

Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Tasks

Action Localization Action Segmentation Long Video Retrieval (Background Removed)Retrieval Temporal Action Localization Temporal Relation Extraction Video Retrieval Zero-Shot Video Retrieval

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions