Snippet-Aware Transformer With Multiple Action Elements for Skeleton-Based Action Segmentation

2024-05-06IEEE Transactions on Neural Networks and Learning Systems 2024Code Available0· sign in to hype

Haoyu Ji, Bowen Chen, Wenze Huang, Weihong Ren, Zhiyong Wang, Honghai Liu

Code Available — Be the first to reproduce this paper.

Code

github.com/HaoyuJi/ME-ST
pytorch★ 4

Abstract

The skeleton-based temporal action segmentation (STAS) aims to densely segment and classify human actions within lengthy untrimmed skeletal motion sequences. Current methods primarily rely on graph convolutional networks (GCNs) for intraframe spatial modeling and temporal convolutional networks (TCNs) for interframe temporal modeling to discern motion patterns. However, these approaches often overlook the distinctive nature of essential action elements across various actions, including engaged core body parts and key subactions. This oversight limits the ability to distinguish different actions within a given sequence. To address these limitations, the snippet-aware Transformer with multiple action element (ME-ST) is proposed to enhance the discrimination and segmentation among actions, which leverages intrasnippet attention along joints and sequences to identify core joints and key subactions at different scales. Specifically, in terms of the spatial domain, the intrasnippet cross-joint attention (CJA) module divides the sequence into distinct snippets and computes attention to establish intricate joint semantic relationships, emphasizing the identification of core motion joints. In terms of the temporal domain, in the encoder, the intrasnippet cross-frame attention (CFA) module segments the sequence in a blockwise expansion manner and establishes interframe relationships to highlight the most discriminative frames. In the decoder, clip-level representations at various temporal scales are initially generated through an hourglass-like sampling process, followed by the intrasnippet cross-scale attention (CSA) module to integrate the key clip information across different time scales. The performance evaluation on five public datasets demonstrates that ME-ST achieves state-of-the-art (SOTA) performance.

Tasks

Action Segmentation Skeleton Based Action Segmentation Temporal Action Segmentation Video Understanding

Snippet-Aware Transformer With Multiple Action Elements for Skeleton-Based Action Segmentation

Code

Abstract

Tasks

Reproductions