CAST: Cross-Attention in Space and Time for Video Action Recognition

2023-11-30NeurIPS 2023Code Available1· sign in to hype

DongHo Lee, Jongseo Lee, Jinwoo Choi

Code Available — Be the first to reproduce this paper.

Code

github.com/khu-vll/cast
In paperpytorch★ 54

Abstract

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

Tasks

Action Classification Action Recognition Action Recognition In Videos Video Understanding

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
EPIC-KITCHENS-100	CAST(ViT-B/16)	Action@1	49.3	—	Unverified
Something-Something V2	CAST(ViT-B/16)	Top-1 Accuracy	71.6	—	Unverified

CAST: Cross-Attention in Space and Time for Video Action Recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions