MSAF: Multimodal Split Attention Fusion

2020-12-13Code Available1· sign in to hype

Lang Su, Chuqing Hu, Guofa Li, Dongpu Cao

Code Available — Be the first to reproduce this paper.

Code

github.com/anita-hu/MSAF
Officialpytorch★ 81

Abstract

Multimodal learning mimics the reasoning process of the human multi-sensory system, which is used to perceive the surrounding world. While making a prediction, the human brain tends to relate crucial cues from multiple sources of information. In this work, we propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities. Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module splits each modality into channel-wise equal feature blocks and creates a joint representation that is used to generate soft attention for each channel across the feature blocks. Further, the MSAF module is designed to be compatible with features of various spatial dimensions and sequence lengths, suitable for both CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal networks and utilize existing pretrained unimodal model weights. To demonstrate the effectiveness of our fusion module, we design three multimodal networks with MSAF for emotion recognition, sentiment analysis, and action recognition tasks. Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.

Tasks

Action Recognition Emotion Recognition Multimodal Emotion Recognition Multimodal Sentiment Analysis Sentiment Analysis

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
NTU RGB+D	MSAF (RGB+Pose)	Accuracy (CS)	92.24	—	Unverified

MSAF: Multimodal Split Attention Fusion

Code

Abstract

Tasks

Benchmark Results

Reproductions