Q8BERT: Quantized 8Bit BERT
Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/intellabs/model-compression-research-packageOfficialpytorch★ 0
- github.com/NervanaSystems/nlp-architect/blob/master/nlp_architect/models/transformers/quantized_bert.pyOfficialtf★ 0
- github.com/huggingface/block_movement_pruningpytorch★ 83
- github.com/iabd/QuantizedNMTpytorch★ 0
- github.com/mindspore-ai/models/tree/master/official/nlp/q8bertmindspore★ 0
Abstract
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4 with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| CoLA | Q8BERT (Zafrir et al., 2019) | Accuracy | 65 | — | Unverified |