ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/tensorflow/meshOfficialIn papertf★ 1,625
- github.com/xuefuzhao/openmoepytorch★ 1,667
- github.com/yikangshen/megablockspytorch★ 20
Abstract
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| arc_challenge | ST-MoE-32B 269B (fine-tuned) | Accuracy | 86.5 | — | Unverified |
| arc_challenge | ST-MoE-L 4.1B (fine-tuned) | Accuracy | 56.9 | — | Unverified |
| arc_easy | ST-MoE-L 4.1B (fine-tuned) | Accuracy | 75.4 | — | Unverified |
| arc_easy | ST-MoE-32B 269B (fine-tuned) | Accuracy | 95.2 | — | Unverified |
| ReCoRD | ST-MoE-L 4.1B (fine-tuned) | EM | 88.9 | — | Unverified |
| ReCoRD | ST-MoE-32B 269B (fine-tuned) | EM | 95.1 | — | Unverified |
| WinoGrande | ST-MoE-32B 269B (fine-tuned) | Accuracy | 96.1 | — | Unverified |
| WinoGrande | ST-MoE-L 4.1B (fine-tuned) | Accuracy | 81.7 | — | Unverified |