| DeepSeek-V3 Technical Report | Dec 27, 2024 | GPULanguage Modeling | CodeCode Available | 16 | 5 |
| Qwen2 Technical Report | Jul 15, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 13 | 5 |
| Qwen2.5 Technical Report | Dec 19, 2024 | Common Sense Reasoning | CodeCode Available | 13 | 5 |
| DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | Dec 13, 2024 | Chart UnderstandingMixture-of-Experts | CodeCode Available | 9 | 5 |
| A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications | Mar 10, 2025 | Continual LearningMeta-Learning | CodeCode Available | 9 | 5 |
| DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence | Jun 17, 2024 | 16kLanguage Modeling | CodeCode Available | 9 | 5 |
| Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters | Jun 10, 2024 | Mixture-of-Experts | CodeCode Available | 9 | 5 |
| DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model | May 7, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 9 | 5 |
| MoBA: Mixture of Block Attention for Long-Context LLMs | Feb 18, 2025 | Mixture-of-Experts | CodeCode Available | 7 | 5 |
| MiniMax-01: Scaling Foundation Models with Lightning Attention | Jan 14, 2025 | Mixture-of-Experts | CodeCode Available | 7 | 5 |