M^2IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension
Xuyang Liu, Ting Liu, Siteng Huang, Yi Xin, Yue Hu, Quanjun Yin, Donglin Wang, Honggang Chen
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M^2IST: Multi-Modal Interactive Side-Tuning with M^3ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained uni-modal encoders fixed, updating M^3ISAs on side networks to progressively connect them, enabling more comprehensive vision-language alignment and efficient tuning for REC. Empirical results reveal that M^2IST achieves an optimal balance between performance and efficiency compared to most full fine-tuning and other PETL methods. With M^2IST, standard transformer-based REC methods present competitive or even superior performance compared to full fine-tuning, while utilizing only 2.11\% of the tunable parameters, 39.61\% of the GPU memory, and 63.46\% of the fine-tuning time required for full fine-tuning.