Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation

2024-07-07Code Available1· sign in to hype

Xinglong Wu, Anfeng Huang, HongWei Yang, Hui He, Yu Tai, Weizhe Zhang

Code Available — Be the first to reproduce this paper.

Code

github.com/WuXinglong-HIT/CLIPER
Officialpytorch★ 12

Abstract

Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained models. However, this might be inappropriate since the abundant task-specific semantics remain unexplored, and the cross-modality semantic gap hinders the recommendation performance. Inspired by the recent progress of the cross-modal alignment model CLIP, in this paper, we propose a novel CLIP Enhanced Recommender (CLIPER) framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information. Specifically, we introduce a multi-view modality-alignment approach for representation extraction and measure the semantic similarity between modalities. Furthermore, we integrate the multi-view multimedia representations into downstream recommendation models. Extensive experiments conducted on three public datasets demonstrate the consistent superiority of our model over state-of-the-art multi-modal recommendation models.

Tasks

cross-modal alignment Multi-modal Recommendation Recommendation Systems Semantic Similarity Semantic Textual Similarity

Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation

Code

Abstract

Tasks

Reproductions