| Training Compute-Optimal Large Language Models | Mar 29, 2022 | AnachronismsAnalogical Similarity | CodeCode Available | 6 |
| Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale | Jul 17, 2024 | GPULAMBADA | CodeCode Available | 2 |
| Scaling Language Models: Methods, Analysis & Insights from Training Gopher | Dec 8, 2021 | Abstract AlgebraAnachronisms | CodeCode Available | 2 |
| Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | Sep 17, 2019 | GPULAMBADA | CodeCode Available | 2 |
| Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM | Nov 3, 2024 | LAMBADAText Generation | CodeCode Available | 1 |
| Beyond Autoregression: Fast LLMs via Self-Distillation Through Time | Oct 28, 2024 | Automated Theorem ProvingCode Generation | CodeCode Available | 1 |
| Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences | Apr 6, 2020 | LAMBADALanguage Modelling | CodeCode Available | 1 |
| The LAMBADA dataset: Word prediction requiring a broad discourse context | Jun 20, 2016 | LAMBADASentence | CodeCode Available | 1 |
| Matryoshka Model Learning for Improved Elastic Student Models | May 29, 2025 | LAMBADAMath | —Unverified | 0 |
| AdaGC: Improving Training Stability for Large Language Model Pretraining | Feb 16, 2025 | LAMBADALanguage Modeling | —Unverified | 0 |