Efficient Language Modeling with Sparse all-MLP

2022-03-14Unverified0· sign in to hype

Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

Unverified — Be the first to reproduce this paper.

Abstract

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2 improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

Tasks

All Common Sense Reasoning In-Context Learning Language Modeling Language Modelling Mixture-of-Experts Question Answering Sentence Completion Zero-Shot Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ReCoRD	Switch Transformer 9B	EM	79.9	—	Unverified
ReCoRD	Base Layers 10B (0-shot)	EM	60.7	—	Unverified
ReCoRD	HASH Layers 10B (0-shot)	EM	67.2	—	Unverified
ReCoRD	Gshard 9B	EM	72.4	—	Unverified
ReCoRD	sMLP – deterministic 9.4B (0-shot)	EM	73.4	—	Unverified
WinoGrande	Base Layers 10B (0-shot)	Accuracy	51	—	Unverified
WinoGrande	Gshard 9B (0-shot)	Accuracy	51.1	—	Unverified
WinoGrande	HASH Layers 10B (0-shot)	Accuracy	51.7	—	Unverified
WinoGrande	Switch Transformer 9B (0-shot)	Accuracy	53.4	—	Unverified
WinoGrande	sMLP – deterministic 9.4B (0-shot)	Accuracy	54.3	—	Unverified

Efficient Language Modeling with Sparse all-MLP

Abstract

Tasks

Benchmark Results

Reproductions