Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

2024-09-12Unverified0· sign in to hype

MengYing Ge, Mingyang Li, Dongkai Tang, Pengbo Li, Kuo Liu, Shuhao Deng, Songbai Pu, Long Liu, Yang song, Tao Zhang

Unverified — Be the first to reproduce this paper.

Abstract

In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks 2nd in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.

Tasks

Emotion Recognition Language Modeling Language Modelling Large Language Model Multimodal Emotion Recognition

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Abstract

Tasks

Reproductions