GLU Variants Improve Transformer
2020-02-12Code Available3· sign in to hype
Noam Shazeer
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/BlinkDL/RWKV-LMpytorch★ 14,428
- github.com/lucidrains/reformer-pytorchpytorch★ 2,192
- github.com/answerdotai/modernbertpytorch★ 1,647
- github.com/lucidrains/performer-pytorchpytorch★ 1,173
- github.com/lucidrains/nuwa-pytorchpytorch★ 549
- github.com/lucidrains/routing-transformerpytorch★ 300
- github.com/lucidrains/progenjax★ 113
- github.com/lucidrains/progen-jaxjax★ 113
- github.com/nlpodyssey/rwkvnone★ 42
- github.com/Rishit-dagli/GLUtf★ 20
Abstract
Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.