Relational Self-Attention: What's Missing in Attention for Video Understanding

2021-11-02NeurIPS 2021Code Available1· sign in to hype

Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, Minsu Cho

Code Available — Be the first to reproduce this paper.

Code

github.com/KimManjin/RSA
Officialpytorch★ 49

Abstract

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

Tasks

Action Recognition Temporal Action Localization Video Understanding

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Diving-48	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Accuracy	84.2	—	Unverified
Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top 1 Accuracy	56.1	—	Unverified
Something-Something V1	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	55.5	—	Unverified
Something-Something V1	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	54	—	Unverified
Something-Something V1	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top 1 Accuracy	51.9	—	Unverified
Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips	Top-1 Accuracy	67.7	—	Unverified
Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	67.3	—	Unverified
Something-Something V2	RSANet-R50 (16 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	66	—	Unverified
Something-Something V2	RSANet-R50 (8 frames, ImageNet pretrained, a single clip)	Top-1 Accuracy	64.8	—	Unverified
Something-Something V2	RSANet-R50 (8+16 frames, ImageNet pretrained, 2 clips)	Top-5 Accuracy	91.1	—	Unverified

Relational Self-Attention: What's Missing in Attention for Video Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions