Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT
Ashutosh Adhikari, Achyudh Ram, Raphael Tang, William L. Hamilton, Jimmy Lin
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs. In this paper, we verify BERT's effectiveness for document classification and investigate the extent to which BERT-level effectiveness can be obtained by different baselines, combined with knowledge distillation---a popular model compression method. The results show that BERT-level effectiveness can be achieved by a single-layer LSTM with at least 40 fewer FLOPS and only 3\% parameters. More importantly, this study analyzes the limits of knowledge distillation as we distill BERT's knowledge all the way down to linear models---a relevant baseline for the task. We report substantial improvement in effectiveness for even the simplest models, as they capture the knowledge learnt by BERT.