Convolutional Two-Stream Network Fusion for Video Action Recognition

2016-04-22CVPR 2016Code Available0· sign in to hype

Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

Code Available — Be the first to reproduce this paper.

Code

github.com/feichtenhofer/twostreamfusion
OfficialIn papernone★ 716
github.com/tomar840/two-stream-fusion-for-action-recognition-in-videos
pytorch★ 91

Abstract

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Tasks

Action Recognition Action Recognition In Videos Temporal Action Localization Vocal Bursts Valence Prediction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
HMDB-51	S:VGG-16, T:VGG-16 (ImageNet pretrained)	Average accuracy of 3 splits	65.4	—	Unverified
UCF101	S:VGG-16, T:VGG-16 (ImageNet pretrain)	3-fold Accuracy	92.5	—	Unverified

Convolutional Two-Stream Network Fusion for Video Action Recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions