XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

2024-02-27Code Available1· sign in to hype

Yuanhang Yang, shiyi qi, Wenchao Gu, Chaozheng Wang, Cuiyun Gao, Zenglin Xu

Code Available — Be the first to reproduce this paper.

Code

github.com/ysngki/xmoe
OfficialIn paperpytorch★ 15

Abstract

Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations via multiplying values by zero or low activation values. To address this issue, we present , a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. leverages small experts and a threshold-based router to enable tokens to selectively engage only essential parameters. Our extensive experiments on language modeling and machine translation tasks demonstrate that can enhance model performance while decreasing the computation load at MoE layers by over 50\% without sacrificing performance. Furthermore, we present the versatility of by applying it to dense models, enabling sparse computation during inference. We provide a comprehensive analysis and make our code available at https://github.com/ysngki/XMoE.

Tasks

Language Modeling Language Modelling Machine Translation Mixture-of-Experts

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Code

Abstract

Tasks

Reproductions