Pay Attention when Required
2020-09-09Code Available0· sign in to hype
Swetha Mandava, Szymon Migacz, Alex Fit Florea
Code Available — Be the first to reproduce this paper.
ReproduceCode
Abstract
Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| enwiki8 | PAR Transformer 24B | Bit per Character (BPC) | 1.11 | — | Unverified |
| Text8 | PAR Transformer 24B | Bit per Character (BPC) | 1.18 | — | Unverified |
| WikiText-103 | PAR Transformer Large | Test perplexity | 18.4 | — | Unverified |
| WikiText-103 | PAR Transformer Base | Test perplexity | 22.7 | — | Unverified |