GLU Variants Improve Transformer

2020-02-12Code Available3· sign in to hype

Noam Shazeer

Code Available — Be the first to reproduce this paper.

Code

github.com/BlinkDL/RWKV-LM
pytorch★ 14,428
github.com/lucidrains/reformer-pytorch
pytorch★ 2,192
github.com/answerdotai/modernbert
pytorch★ 1,647
github.com/lucidrains/performer-pytorch
pytorch★ 1,173
github.com/lucidrains/nuwa-pytorch
pytorch★ 549
github.com/lucidrains/routing-transformer
pytorch★ 300
github.com/lucidrains/progen
jax★ 113
github.com/lucidrains/progen-jax
jax★ 113
github.com/nlpodyssey/rwkv
none★ 42
github.com/Rishit-dagli/GLU
tf★ 20

Abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

GLU Variants Improve Transformer

Code

Abstract

Reproductions