MMRL: Multi-Modal Representation Learning for Vision-Language Models

2025-03-11CVPR 2025Code Available2· sign in to hype

Yuncheng Guo, Xiaodong Gu

Code Available — Be the first to reproduce this paper.

Code

github.com/yunncheng/MMRL
OfficialIn paperpytorch★ 99

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.

Tasks

Prompt Engineering Representation Learning Transfer Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Caltech-101	MMRL	Harmonic mean	96.68	—	Unverified
DTD	MMRL	Harmonic mean	73.82	—	Unverified
EuroSAT	MMRL	Harmonic mean	87.21	—	Unverified
FGVC-Aircraft	MMRL	Harmonic mean	41.15	—	Unverified
Food-101	MMRL	Harmonic mean	91.03	—	Unverified
ImageNet	MMRL	Harmonic mean	74.45	—	Unverified
ImageNet-A	MMRL	Top-1 accuracy %	51.2	—	Unverified
ImageNet-R	MMRL	Top-1 accuracy %	77.53	—	Unverified
ImageNet-S	MMRL	Top-1 accuracy %	49.17	—	Unverified
ImageNet V2	MMRL	Top-1 accuracy %	64.47	—	Unverified
Oxford 102 Flower	MMRL	Harmonic mean	86.78	—	Unverified
Oxford-IIIT Pet Dataset	MMRL	Harmonic mean	96.74	—	Unverified
Stanford Cars	MMRL	Harmonic mean	78.06	—	Unverified
SUN397	MMRL	Harmonic mean	81.2	—	Unverified
UCF101	MMRL	Harmonic mean	83.89	—	Unverified

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions