Structured Pruning Learns Compact and Accurate Models
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
The growing size of neural language models has led to increased attention in model compression. Pruning methods start from a large model and gradually remove model weights---they can significantly reduce the model size but hardly achieve impressive runtime efficiency. On the other hand, distillation methods start from a shallower, compact model and can obtain large speedups---however, they are costly to train on large amounts of unlabeled data. In this work, we show that structured pruning can match the distillation counterparts in both latency (>10) and accuracy (>92\%) and result in highly compact and efficient subnetworks. Unlike distillation, our task-specific pruning approach, , does not need to pre-specify the model architecture nor rely on unlabeled data. Our solution is to jointly prune layers and sub-modules such as heads and hidden units in Transformer models through l_0 regularization while ensuring that the resulting model is parallelizable. We also propose a layerwise distillation approach to further guide pruning. Finally, our pruned structures reveal interesting patterns---for example, more than 70\% of feed-forward and 50\% of self-attention layers can be easily pruned, while the first and last 1-2 layers are likely to remain for highly compressed models.