Visual Prompt Tuning
Visual Prompt Tuning(VPT) only introduces a small amount of task-specific learnable parameters into the input space while freezing the entire pre-trained Transformer backbone during downstream training. In practice, these additional parameters are simply prepended into the input sequence of each Transformer layer and learned together with a linear head during fine-tuning. VPT is especially effective in the low-data regime, and maintains its advantage across data scales. Finally, VPT is competitive for a range of Transformer scales and designs (ViTBase/Large/Huge, Swin). Put together, the results suggest that VPT is one of the most effective ways of adapting ever-growing vision backbones.
Papers
Showing 1–10 of 70 papers
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 86 | — | Unverified |
| 2 | SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 84.08 | — | Unverified |
| 3 | SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 83.26 | — | Unverified |
| 4 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 83.12 | — | Unverified |
| 5 | GateVPT(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 83 | — | Unverified |
| 6 | VPT-Shallow (ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 79.26 | — | Unverified |
| 7 | SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 73.95 | — | Unverified |
| 8 | GateVPT(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 73.39 | — | Unverified |
| 9 | VPT-Deep (ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 72.02 | — | Unverified |
| 10 | VPT-Shallow (ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 57.84 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 76.2 | — | Unverified |
| 2 | GateVPT(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 74.84 | — | Unverified |
| 3 | SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 74.47 | — | Unverified |
| 4 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 70.27 | — | Unverified |
| 5 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 67.34 | — | Unverified |
| 6 | SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 67.19 | — | Unverified |
| 7 | SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 62.53 | — | Unverified |
| 8 | GateVPT(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 47.61 | — | Unverified |
| 9 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 39.96 | — | Unverified |
| 10 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 36.02 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 84.95 | — | Unverified |
| 2 | SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 83.93 | — | Unverified |
| 3 | GateVPT(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 83.38 | — | Unverified |
| 4 | SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 83.15 | — | Unverified |
| 5 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 83.04 | — | Unverified |
| 6 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 82.26 | — | Unverified |
| 7 | SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 80.9 | — | Unverified |
| 8 | GateVPT(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 76.86 | — | Unverified |
| 9 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 69.65 | — | Unverified |
| 10 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 60.61 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | SPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 59.23 | — | Unverified |
| 2 | SPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 58.36 | — | Unverified |
| 3 | SPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 55.16 | — | Unverified |
| 4 | SPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 53.46 | — | Unverified |
| 5 | GateVPT(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 49.1 | — | Unverified |
| 6 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 42.38 | — | Unverified |
| 7 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy | 37.55 | — | Unverified |
| 8 | GateVPT(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 36.8 | — | Unverified |
| 9 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 27.5 | — | Unverified |
| 10 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy | 26.57 | — | Unverified |