| Training Compute-Optimal Large Language Models | Mar 29, 2022 | AnachronismsAnalogical Similarity | CodeCode Available | 6 | 5 |
| Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale | Jul 17, 2024 | GPULAMBADA | CodeCode Available | 2 | 5 |
| Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | Sep 17, 2019 | GPULAMBADA | CodeCode Available | 2 | 5 |
| Scaling Language Models: Methods, Analysis & Insights from Training Gopher | Dec 8, 2021 | Abstract AlgebraAnachronisms | CodeCode Available | 2 | 5 |
| Beyond Autoregression: Fast LLMs via Self-Distillation Through Time | Oct 28, 2024 | Automated Theorem ProvingCode Generation | CodeCode Available | 1 | 5 |
| Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM | Nov 3, 2024 | LAMBADAText Generation | CodeCode Available | 1 | 5 |
| Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences | Apr 6, 2020 | LAMBADALanguage Modelling | CodeCode Available | 1 | 5 |
| The LAMBADA dataset: Word prediction requiring a broad discourse context | Jun 20, 2016 | LAMBADASentence | CodeCode Available | 1 | 5 |
| The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models | Aug 13, 2021 | LAMBADAText Generation | CodeCode Available | 0 | 5 |
| Inconsistencies in Masked Language Models | Dec 30, 2022 | LAMBADAMMLU | CodeCode Available | 0 | 5 |