End-to-end transfer learning for speaker-independent cross-language and cross-corpus speech emotion recognition
Duowei Tang, Peter Kuppens, Luc Geurts, Toon van Waterschoot
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Data-driven models achieve successful results in Speech Emotion Recognition (SER). However, these models, which are often based on general acoustic features or end-to-end approaches, show poor performance when the testing set has a different language than the training set or when these sets are taken from different datasets. To alleviate these problems, this paper presents an end-to-end Deep Neural Network (DNN) model based on transfer learning for cross-language and cross-corpus SER. We use the wav2vec 2.0 pre-trained model to transform audio time-domain waveforms from different languages, different speakers and different recording conditions into a feature space shared by multiple languages, thereby reducing the language variabilities in the speech embeddings. Next, we propose a new Deep-Within-Class Covariance Normalisation (Deep-WCCN) layer that can be inserted into the DNN model and aims to reduce other variabilities including speaker variability, channel variability and so on. The entire model is fine-tuned in an end-to-end manner on a combined loss and is validated on datasets from three languages (i.e. English, German, Chinese). Experimental results show that our proposed method outperforms the baseline model that is based on common acoustic feature sets for SER in the within-language setting and the cross-language setting. In addition, we also experimentally validate the effectiveness of Deep-WCCN, which can further improve the model performance. Next, we show that the proposed transfer learning method has good data efficiency when merging target language data into the fine-tuning process. The model speaker-independent SER performance increases with up to 15.6% when only 160s of target language data is used. Finally, our proposed model shows significantly better performance than other state-of-the-art models in cross-language SER.