Q8BERT: Quantized 8Bit BERT

2019-10-14Code Available1· sign in to hype

Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat

Code Available — Be the first to reproduce this paper.

Code

github.com/NervanaSystems/nlp-architect/blob/master/nlp_architect/models/transformers/quantized_bert.py
Officialtf★ 0
github.com/huggingface/block_movement_pruning
pytorch★ 83
github.com/iabd/QuantizedNMT
pytorch★ 0
github.com/mindspore-ai/models/tree/master/official/nlp/q8bert
mindspore★ 0

Abstract

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4 with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

Tasks

Linguistic Acceptability Natural Language Inference Quantization Semantic Textual Similarity Sentiment Analysis

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CoLA	Q8BERT (Zafrir et al., 2019)	Accuracy	65	—	Unverified

Q8BERT: Quantized 8Bit BERT

Code

Abstract

Tasks

Benchmark Results

Reproductions