MAX-AST: COMBINING CONVOLUTION, LOCAL AND GLOBAL SELF-ATTENTIONS FOR AUDIO EVENT CLASSIFICATION

2024-04-14ICASSP 2024Code Available1· sign in to hype

Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson

Code Available — Be the first to reproduce this paper.

Code

github.com/ta012/MaxAST
In paperpytorch★ 13

Abstract

In the domain of audio transformer architectures, prior research has extensively investigated isotropic architectures that capture the global context through full self-attention and hierarchical architectures that progressively transition from local to global context utilising hierarchical structures with convolutions or window-based attention. However, the idea of imbuing each individual block with both local and global contexts, thereby creating a hybrid transformer block, remains relatively under-explored in the field. To facilitate this exploration, we introduce Multi Axis Audio Spectrogram Transformer (Max-AST), an adaptation of MaxViT to the audio domain. Our approach leverages convolution, local window-attention, and global grid-attention in all the transformer blocks. The proposed model excels in efficiency compared to prior methods and consistently outperforms state-of-the-art techniques, achieving significant gains of up to 2.6% on the AudioSet full set. Further, we performed detailed ablations to analyse the impact of each of these components on audio feature learning. The source code is available at https://github.com/ta012/MaxAST.git

Tasks

Audio Classification

MAX-AST: COMBINING CONVOLUTION, LOCAL AND GLOBAL SELF-ATTENTIONS FOR AUDIO EVENT CLASSIFICATION

Code

Abstract

Tasks

Reproductions