SOTAVerified

Revisiting the Power of Prompt for Visual Tuning

2024-02-04Code Available0· sign in to hype

Yuzhu Wang, Lechao Cheng, Chaowei Fang, Dingwen Zhang, Manni Duan, Meng Wang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Visual prompt tuning (VPT) is a promising solution incorporating learnable prompt tokens to customize pre-trained models for downstream tasks. However, VPT and its variants often encounter challenges like prompt initialization, prompt length, and subpar performance in self-supervised pretraining, hindering successful contextual adaptation. This study commences by exploring the correlation evolvement between prompts and patch tokens during proficient training. Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes. The strategic initialization, a stand-in for the previous initialization, substantially improves performance in fine-tuning. To refine further, we optimize token construction with a streamlined pipeline that maintains excellent performance with almost no increase in computational expenses compared to VPT. Exhaustive experiments show our proposed approach outperforms existing methods by a remarkable margin. For instance, it surpasses full fine-tuning in 19 out of 24 tasks, using less than 0.4% of learnable parameters on the FGVC and VTAB-1K benchmarks. Notably, our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%. Besides, the experimental results demonstrate the proposed SPT is robust to prompt lengths and scales well with model capacity and training data size. We finally provide an insightful exploration into the amount of target data facilitating the adaptation of pre-trained models to downstream tasks. The code is available at https://github.com/WangYZ1608/Self-Prompt-Tuning.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
FGVCSPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy86Unverified
FGVCSPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy84.08Unverified
FGVCSPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy83.26Unverified
FGVCSPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy73.95Unverified
VTAB-1k(Natural<7>)SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy76.2Unverified
VTAB-1k(Natural<7>)SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy74.47Unverified
VTAB-1k(Natural<7>)SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy67.19Unverified
VTAB-1k(Natural<7>)SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy62.53Unverified
VTAB-1k(Specialized<4>)SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy84.95Unverified
VTAB-1k(Specialized<4>)SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy83.93Unverified
VTAB-1k(Specialized<4>)SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy83.15Unverified
VTAB-1k(Specialized<4>)SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy80.9Unverified
VTAB-1k(Structured<8>)SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy53.46Unverified
VTAB-1k(Structured<8>)SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy58.36Unverified
VTAB-1k(Structured<8>)SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K)Mean Accuracy55.16Unverified
VTAB-1k(Structured<8>)SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K)Mean Accuracy59.23Unverified

Reproductions