Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

2022-12-14Code Available1· sign in to hype

Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/fairseq/tree/main/examples/data2vec
Officialpytorch★ 0
github.com/gaasher/data2vec2.0_vision
pytorch★ 18
github.com/ashutosh1919/data2vec-pytorch
pytorch★ 16
github.com/MindSpore-scientific/code-3/tree/main/contextual-learning
mindspore★ 0

Abstract

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.

Tasks

Decoder image-classification Image Classification Natural Language Understanding Self-Supervised Learning speech-recognition Speech Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	data2vec 2.0	Top 1 Accuracy	87.4	—	Unverified

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Code

Abstract

Tasks

Benchmark Results

Reproductions