M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA framework

2024-09-09Code Available1· sign in to hype

Hongyang Lei, Xiaolong Cheng, Dan Wang, Kun Fan, Qi Qin, Huazhen Huang, Yetao Wu, Qingqing Gu, Zhonglin Jiang, Yong Chen, Luo Ji

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/HongyangLL/M3-JEPA
Officialpytorch★ 23

Abstract

Current multimodal alignment strategies primarily use single or unified modality encoders, while optimizing the alignment on the original token space. Such a framework is easy to implement and incorporate with the pretrained knowledge, but might result in information bias. To deal with such issues, the joint encoding predictive architecture (JEPA) learns the alignment loss on the latent space, with a predictor to convert the input encoding to the output latent space. However, the application of JEPA in multimodal scenarios is limited so far. In this paper, we introduce M3-Jepa, a scalable multimodal alignment framework, with the predictor implemented by a multi-directional mixture of experts (MoE). We demonstrate the framework can maximize the mutual information with information theory derivations, by alternating the optimization between different uni-directional tasks. By thoroughly designed experiments, we show that M3-Jepa can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in training and inference. Our study indicates that M3-Jepa might provide a new paradigm to self-supervised learning and open-world modeling.

Tasks

Computational Efficiency Cross-Modal Retrieval Mixture-of-Experts Question Answering Retrieval Self-Supervised Learning Visual Question Answering

M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA framework

Code

Abstract

Tasks

Reproductions