SOTAVerified

LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships for Zero-Shot Action Recognition

2024-11-27IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) 2024Code Available0· sign in to hype

Sandipan Sarma, Divyam Singal, Arijit Sur

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The increasing number of actions in the real world makes it difficult for traditional deep-learning models to recognize unseen actions. Recently, pretrained contrastive image-based visual-language (I-VL) models have been adapted for efficient “zero-shot” scene understanding. Pairing such models with transformers to implement temporal modeling has been rewarding for zero-shot action recognition (ZSAR). However, the significance of modeling the local spatial context of objects and action environments remains unexplored. In this work, we propose a ZSAR framework called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal transformer (LoCATe) and a Graph Attention Network (GAT). Specifically, image and text encodings extracted from a pretrained I-VL model are used as inputs for LoCATe-GAT. Motivated by the observation that object-centric and environmental contexts drive both distinguishability and functional similarity between actions, LoCATe captures multi-scale local context using dilated convolutional layers during temporal modeling. Furthermore, the proposed GAT models semantic relationships between classes and achieves a strong synergy with the video embeddings produced by LoCATe. Extensive experiments on four widely-used benchmarks – UCF101, HMDB51, ActivityNet, and Kinetics – show we achieve state-of-the-art results. Specifically, we obtain relative gains of 3.8% and 4.8% on these datasets in conventional and 16.6% on UCF101in generalized ZSAR settings. For large-scale datasets like ActivityNet and Kinetics, our method achieves a relative gain of 31.8% and 27.9%, respectively, over the previous methods. Additionally, we gain 25.3% and 18.4% on UCF101 and HMDB51 as per the recent “TruZe” evaluation protocol.

Tasks

Reproductions