MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

2025-05-15Code Available2· sign in to hype

Yuncheng Guo, Xiaodong Gu

Code Available — Be the first to reproduce this paper.

Code

github.com/yunncheng/MMRL
Officialpytorch★ 99

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have significantly advanced transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, undermining their ability to generalize to new tasks. To address this, we propose Multi-Modal Representation Learning (MMRL), which introduces a shared, learnable, modality-agnostic representation space. MMRL generates space tokens projected into both text and image encoders as representation tokens, enabling more effective cross-modal interactions. Unlike prior methods that mainly optimize class token features, MMRL inserts representation tokens into higher encoder layers--where task-specific features are more prominent--while preserving general knowledge in the lower layers. During training, both class and representation features are jointly optimized: a trainable projection layer is applied to representation tokens for task adaptation, while the projection layer for class token remains frozen to retain pre-trained knowledge. To further promote generalization, we introduce a regularization term aligning class and text features with the frozen VLM's zero-shot features. At inference, a decoupling strategy uses both class and representation features for base tasks, but only class features for novel tasks due to their stronger generalization. Building upon this, we propose MMRL++, a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters and enhances intra-modal interactions--particularly across the layers of representation tokens--allowing gradient sharing and instance-specific information to propagate more effectively through the network. Extensive experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods, achieving a strong balance between task-specific adaptation and generalization.

Tasks

General Knowledge Prompt Engineering Representation Learning Transfer Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Caltech-101	MMRL++	Harmonic mean	96.75	—	Unverified
DTD	MMRL++	Harmonic mean	74.46	—	Unverified
EuroSAT	MMRL++	Harmonic mean	91.94	—	Unverified
FGVC-Aircraft	MMRL++	Harmonic mean	42.24	—	Unverified
Food-101	MMRL++	Harmonic mean	91.1	—	Unverified
ImageNet	MMRL++	Harmonic mean	74.44	—	Unverified
Oxford 102 Flower	MMRL++	Harmonic mean	87.01	—	Unverified
Oxford-IIIT Pet Dataset	MMRL++	Harmonic mean	96.51	—	Unverified
Stanford Cars	MMRL++	Harmonic mean	78.18	—	Unverified
SUN397	MMRL++	Harmonic mean	81.28	—	Unverified
UCF101	MMRL++	Harmonic mean	83.81	—	Unverified

MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions