Compressive Transformers for Long-Range Sequence Modelling
2019-11-13ICLR 2020Code Available1· sign in to hype
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap
Code Available — Be the first to reproduce this paper.
ReproduceCode
Abstract
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| enwik8 | Compressive Transformer (24 layers) | Bit per Character (BPC) | 0.97 | — | Unverified |
| Hutter Prize | Compressive Transformer | Bit per Character (BPC) | 0.97 | — | Unverified |
| WikiText-103 | Compressive Transformer (18L, M=1024) | Test perplexity | 17.1 | — | Unverified |