Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

2025-02-17ICASSP 2025Code Available1· sign in to hype

Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid

Code Available — Be the first to reproduce this paper.

Code

github.com/aurianworld/matpac
pytorch★ 27

Abstract

Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.

Tasks

Audio Classification Audio Tagging Classification Environmental Sound Classification Instrument Recognition Music Auto-Tagging Music Genre Classification Music Tagging Prediction Representation Learning Self-Supervised Audio Classification Self-Supervised Learning TAG

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ESC-50	MATPAC (SSL model, linear eval)	Top-1 Accuracy	93.5	—	Unverified
FSD50K	MATPAC (SSL Model)	mAP	55.2	—	Unverified

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Code

Abstract

Tasks

Benchmark Results

Reproductions