Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition
Dongyuan Li, Yusong Wang, Kotaro Funakoshi, Manabu Okumura
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/wykstc/MERC-mainOfficialpytorch★ 18
Abstract
Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities, which has received increasing attention for its application in human-machine interaction. Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue. Furthermore, with the number of graph layers increasing, they easily fall into over-smoothing. In this paper, we propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful), where multimodality fusion, contrastive learning, and emotion recognition are jointly optimized. Specifically, we first design a new multimodal fusion mechanism that can provide deep interaction and fusion between the global contextual and uni-modal specific features. Then, we introduce a graph contrastive learning framework with inter-view and intra-view contrastive losses to learn more distinguishable representations for samples with different sentiments. Extensive experiments on three benchmark datasets indicate that Joyful achieved state-of-the-art (SOTA) performance compared to all baselines.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| IEMOCAP | Joyful | Weighted F1 | 70.5 | — | Unverified |
| IEMOCAP-4 | Joyful | Weighted F1 | 85.7 | — | Unverified |
| MELD | Joyful | Weighted F1 | 61.77 | — | Unverified |