Let Go of Your Labels with Unsupervised Transfer
Artyom Gadetsky, Yulun Jiang, Maria Brbic
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/mlbio-epfl/turtleOfficialpytorch★ 80
Abstract
Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Birdsnap | TURTLE (CLIP + DINOv2) | Accuracy | 68.1 | — | Unverified |
| Caltech-101 | TURTLE (CLIP + DINOv2) | Accuracy | 89.8 | — | Unverified |
| CIFAR-10 | TURTLE (CLIP + DINOv2) | Accuracy | 1 | — | Unverified |
| CIFAR-100 | TURTLE (CLIP + DINOv2) | Accuracy | 0.9 | — | Unverified |
| CLEVR Counts | TURTLE (CLIP + DINOv2) | Accuracy | 24 | — | Unverified |
| Country211 | TURTLE (CLIP + DINOv2) | Accuracy | 11.1 | — | Unverified |
| DTD | TURTLE (CLIP + DINOv2) | Accuracy | 57.3 | — | Unverified |
| EuroSAT | TURTLE (CLIP + DINOv2) | Accuracy | 96.6 | — | Unverified |
| FER2013 | TURTLE (CLIP + DINOv2) | Accuracy | 36.2 | — | Unverified |
| FGVC-Aircraft | TURTLE (CLIP + DINOv2) | Accuracy | 36.5 | — | Unverified |
| Flowers-102 | TURTLE (CLIP + DINOv2) | Accuracy | 99.6 | — | Unverified |
| Food-101 | TURTLE (CLIP + DINOv2) | Accuracy | 92.2 | — | Unverified |
| GTSRB | TURTLE (CLIP + DINOv2) | Accuracy | 48.4 | — | Unverified |
| Hateful Memes | TURTLE (CLIP + DINOv2) | Accuracy | 54.2 | — | Unverified |
| ImageNet | TURTLE (CLIP + DINOv2) | Accuracy | 72.9 | — | Unverified |
| Kinetics-700 | TURTLE (CLIP + DINOv2) | Accuracy | 43 | — | Unverified |
| KITTI | TURTLE (CLIP + DINOv2) | Accuracy | 39.4 | — | Unverified |
| MNIST | TURTLE (CLIP + DINOv2) | Accuracy | 97.8 | — | Unverified |
| Oxford-IIIT Pets | TURTLE (CLIP + DINOv2) | Accuracy | 92.3 | — | Unverified |
| PCam | TURTLE (CLIP + DINOv2) | Accuracy | 52 | — | Unverified |
| Rendered SST2 | TURTLE (CLIP + DINOv2) | Accuracy | 51.6 | — | Unverified |
| RESISC45 | TURTLE (CLIP + DINOv2) | Accuracy | 89.6 | — | Unverified |
| Stanford Cars | TURTLE (CLIP + DINOv2) | Accuracy | 0.65 | — | Unverified |
| STL-10 | TURTLE (CLIP + DINOv2) | Accuracy | 1 | — | Unverified |
| SUN397 | TURTLE (CLIP + DINOv2) | Accuracy | 67.9 | — | Unverified |
| UCF101 | TURTLE (CLIP + DINOv2) | Accuracy | 82.3 | — | Unverified |