End-to-End Learning of Visual Representations from Uncurated Instructional Videos

2019-12-13CVPR 2020Code Available1· sign in to hype

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

Code Available — Be the first to reproduce this paper.

Code

github.com/antoine77340/MIL-NCE_HowTo100M
Officialpytorch★ 0
github.com/antoine77340/milnce_howto100m
pytorch★ 219
github.com/linjieli222/hero_video_feature_extractor
pytorch★ 117
github.com/antoine77340/S3D_HowTo100M
pytorch★ 0

Abstract

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Tasks

Action Localization Action Recognition Action Segmentation Long Video Retrieval (Background Removed)Retrieval Text to Video Retrieval Video Retrieval Zero-Shot Video Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RareAct	HT100M S3D	mWAP	30.5	—	Unverified

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Code

Abstract

Tasks

Benchmark Results

Reproductions