Reducing Transformer Depth on Demand with Structured Dropout

2019-09-25ICLR 2020Code Available1· sign in to hype

Angela Fan, Edouard Grave, Armand Joulin

Code Available — Be the first to reproduce this paper.

Code

github.com/prajjwal1/adaptive_transformer
pytorch★ 43
github.com/thunlp-mt/promptgating4mctg
pytorch★ 14
github.com/c00k1ez/plain-transformers
pytorch★ 9

Abstract

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality compared to training from scratch or using distillation.

Tasks

Language Modeling Language Modelling Machine Translation Open-Domain Question Answering Question Answering Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ELI5	Transformer Multitask + LayerDrop	Rouge-L	23.4	—	Unverified

Reducing Transformer Depth on Demand with Structured Dropout

Code

Abstract

Tasks

Benchmark Results

Reproductions