Autoregressive Knowledge Distillation through Imitation Learning

2020-09-15EMNLP 2020Code Available0· sign in to hype

Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei

Code Available — Be the first to reproduce this paper.

Code

github.com/asappresearch/imitkd
OfficialIn paperpytorch★ 2
github.com/hubreb/imitkd_ast
pytorch★ 1

Abstract

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

Tasks

Imitation Learning Knowledge Distillation Machine Translation Text Generation Translation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
IWSLT2014 German-English	ImitKD + Full	BLEU score	35.4	—	Unverified

Autoregressive Knowledge Distillation through Imitation Learning

Code

Abstract

Tasks

Benchmark Results

Reproductions