Reformer: The Efficient Transformer

2020-01-13ICLR 2020Code Available2· sign in to hype

Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

Code Available — Be the first to reproduce this paper.

Code

github.com/google/trax/tree/master/trax/models/reformer
Officialjax★ 0
github.com/lucidrains/reformer-pytorch
pytorch★ 2,192
github.com/Rick-McCoy/Reformer-pytorch
pytorch★ 86
github.com/lucashueda/long_sentence_transformer
pytorch★ 4
github.com/junnyu/paddle_reformer
jax★ 4
github.com/sliao-mi-luku/NLP-Chatbot-Reformer-Trax
pytorch★ 2
github.com/sliao-mi-luku/Chatbot-Reformer
none★ 2
github.com/yangyucheng000/University/tree/main/model-3/reformer
mindspore★ 0
github.com/MindCode-4/code-3/tree/main/efficientformer
mindspore★ 0

Abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(L L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Tasks

D4RL Image Generation Language Modelling Offline RL Open-Domain Question Answering Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet 64x64	Reformer (12 layers)	Bits per dim	3.71	—	Unverified
ImageNet 64x64	Reformer (6 layers)	Bits per dim	3.74	—	Unverified

Reformer: The Efficient Transformer

Code

Abstract

Tasks

Benchmark Results

Reproductions