SOTAVerified

Learning Correlation Structures for Vision Transformers

2024-04-05CVPR 2024Unverified0· sign in to hype

Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
Diving-48StructVit-B-4-1Accuracy88.3Unverified
Something-Something V1StructVit-B-4-1Top 1 Accuracy61.3Unverified
Something-Something V2StructVit-B-4-1Top-1 Accuracy71.5Unverified

Reproductions