Learning Correlation Structures for Vision Transformers

2024-04-05CVPR 2024Unverified0· sign in to hype

Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

Unverified — Be the first to reproduce this paper.

Abstract

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

Tasks

Action Classification Action Recognition Object Video Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Diving-48	StructVit-B-4-1	Accuracy	88.3	—	Unverified
Something-Something V1	StructVit-B-4-1	Top 1 Accuracy	61.3	—	Unverified
Something-Something V2	StructVit-B-4-1	Top-1 Accuracy	71.5	—	Unverified

Learning Correlation Structures for Vision Transformers

Abstract

Tasks

Benchmark Results

Reproductions