SOTAVerified

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

2019-11-11Code Available0· sign in to hype

Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, Zheng Zhang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of self-attention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields O(k n (n/k)) connections where k is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
enwik8BP-Transformer (12 layers)Bit per Character (BPC)1.02Unverified
Text8BP-Transformer - 12 LayersBit per Character (BPC)1.11Unverified

Reproductions