SOTAVerified

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

2024-09-12Unverified0· sign in to hype

MengYing Ge, Mingyang Li, Dongkai Tang, Pengbo Li, Kuo Liu, Shuhao Deng, Songbai Pu, Long Liu, Yang song, Tao Zhang

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks 2nd in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.

Tasks

Reproductions