Multimodal Token Fusion for Vision Transformers

2022-04-19journal 2022Code Available1· sign in to hype

Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang

Code Available — Be the first to reproduce this paper.

Code

github.com/yikaiw/TokenFusion
OfficialIn paperpytorch★ 185
github.com/huawei-noah/noah-research/tree/master/TokenFusion
Officialpytorch★ 0
github.com/mindspore-ai/models/tree/master/research/cv/TokenFusion
Officialmindspore★ 0
github.com/harshm121/m3l
pytorch★ 43
github.com/lyqcom/models-master
mindspore★ 2
github.com/robin-ex/TokenFusion
mindspore★ 1
github.com/MindSpore-paper-code-2/code3/tree/main/TokenFusion
mindspore★ 0
github.com/2023-MindSpore-1/ms-code-217/tree/main/TokenFusion
mindspore★ 0
github.com/2024-MindSpore-1/Code2/tree/main/wangyikai/EIP-mindspore
mindspore★ 0
github.com/2023-MindSpore-1/ms-code-7/tree/main/TokenFusion
mindspore★ 0

Abstract

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.

Tasks

3D Object Detection Image-to-Image Translation object-detection Object Detection Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ScanNetV2	TokenFusion	mAP@0.5	54.2	—	Unverified
SUN-RGBD val	TokenFusion	mAP@0.25	64.9	—	Unverified

Multimodal Token Fusion for Vision Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions