A vector quantized masked autoencoder for speech emotion recognition

2023-04-21Code Available1· sign in to hype

Samir Sadok, Simon Leglaive, Renaud Séguier

Code Available — Be the first to reproduce this paper.

Code

github.com/samsad35/VQ-MAE-S-code
Officialpytorch★ 30

Abstract

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.

Tasks

Emotion Recognition Self-Supervised Learning Speech Emotion Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
EmoDB Dataset	VQ-MAE-S-12 (Frame) + Query2Emo	Accuracy	90.2	—	Unverified
RAVDESS	VQ-MAE-S-12 (Frame) + Query2Emo	Accuracy	84.1	—	Unverified

A vector quantized masked autoencoder for speech emotion recognition

Code

Abstract

Tasks

Benchmark Results

Reproductions