Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/facebookresearch/fairseq/tree/main/examples/data2vecOfficialpytorch★ 0
- github.com/gaasher/data2vec2.0_visionpytorch★ 18
- github.com/ashutosh1919/data2vec-pytorchpytorch★ 16
- github.com/MindSpore-scientific/code-3/tree/main/contextual-learningmindspore★ 0
- gitlab.com/birder/birderpytorch★ 0
Abstract
Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| ImageNet | data2vec 2.0 | Top 1 Accuracy | 87.4 | — | Unverified |