Visual Prompt Tuning
Visual Prompt Tuning(VPT) only introduces a small amount of task-specific learnable parameters into the input space while freezing the entire pre-trained Transformer backbone during downstream training. In practice, these additional parameters are simply prepended into the input sequence of each Transformer layer and learned together with a linear head during fine-tuning. VPT is especially effective in the low-data regime, and maintains its advantage across data scales. Finally, VPT is competitive for a range of Transformer scales and designs (ViTBase/Large/Huge, Swin). Put together, the results suggest that VPT is one of the most effective ways of adapting ever-growing vision backbones.
Papers
Showing 1–10 of 70 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 76.2 | — | Unverified |
| 2 | GateVPT(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 74.84 | — | Unverified |
| 3 | SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 74.47 | — | Unverified |
| 4 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 70.27 | — | Unverified |
| 5 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 67.34 | — | Unverified |
| 6 | SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 67.19 | — | Unverified |
| 7 | SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 62.53 | — | Unverified |
| 8 | GateVPT(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 47.61 | — | Unverified |
| 9 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 39.96 | — | Unverified |
| 10 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 36.02 | — | Unverified |