Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

2019-08-23ICLR 2020Code Available2· sign in to hype

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Code Available — Be the first to reproduce this paper.

Code

github.com/google-research/bert
Officialtf★ 39,935
github.com/PAIR-code/lit
pytorch★ 3,642
github.com/google-research/tapas
tf★ 1,204
github.com/google-research/bleurt
tf★ 789
github.com/geondopark/ckd
pytorch★ 35
github.com/thousandvoices/ok_ml_cup
pytorch★ 2
github.com/paolanu/BERT_epitope
tf★ 0
github.com/SpikeKing/My-Bert
tf★ 0
github.com/StoneGH/bert
tf★ 0
github.com/cuber2460/bert
tf★ 0

Abstract

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

Tasks

Knowledge Distillation Language Modelling Model Compression Sentiment Analysis

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Code

Abstract

Tasks

Reproductions