Spatiotemporal Residual Networks for Video Action Recognition

2016-11-07NeurIPS 2016Code Available1· sign in to hype

Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

Code Available — Be the first to reproduce this paper.

Code

github.com/feichtenhofer/st-resnet
OfficialIn papernone★ 178

Abstract

Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time. This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.

Tasks

Action Recognition Action Recognition In Videos Temporal Action Localization

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
HMDB-51	ST-ResNet + IDT	Average accuracy of 3 splits	70.3	—	Unverified
UCF101	ST-ResNet + IDT	3-fold Accuracy	94.6	—	Unverified

Spatiotemporal Residual Networks for Video Action Recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions