SOTAVerified

STM: SpatioTemporal and Motion Encoding for Action Recognition

2019-08-07ICCV 2019Unverified0· sign in to hype

Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, Junjie Yan

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
HMDB-51STM (ImageNet+Kinetics pretrain)Average accuracy of 3 splits72.2Unverified
Jester (Gesture Recognition)STM (Resnet-50, 16 frames)Val96.7Unverified
Something-Something V1STM (16 frames, ImageNet pretraining)Top 1 Accuracy50.7Unverified
Something-Something V2STM (16 frames, ImageNet pretraining)Top-1 Accuracy64.2Unverified
UCF101STM (ImageNet+Kinetics pretrain)3-fold Accuracy96.2Unverified

Reproductions