Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/google/trax/tree/master/trax/models/reformerOfficialjax★ 0
- github.com/huggingface/transformerspytorch★ 158,292
- github.com/lucidrains/reformer-pytorchpytorch★ 2,192
- github.com/Rick-McCoy/Reformer-pytorchpytorch★ 86
- github.com/lucashueda/long_sentence_transformerpytorch★ 4
- github.com/junnyu/paddle_reformerjax★ 4
- github.com/sliao-mi-luku/NLP-Chatbot-Reformer-Traxpytorch★ 2
- github.com/sliao-mi-luku/Chatbot-Reformernone★ 2
- github.com/yangyucheng000/University/tree/main/model-3/reformermindspore★ 0
- github.com/MindCode-4/code-3/tree/main/efficientformermindspore★ 0
Abstract
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(L L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| ImageNet 64x64 | Reformer (12 layers) | Bits per dim | 3.71 | — | Unverified |
| ImageNet 64x64 | Reformer (6 layers) | Bits per dim | 3.74 | — | Unverified |