| Catwalk: A Unified Language Model Evaluation Framework for Many Datasets | Dec 15, 2023 | In-Context LearningLanguage Model Evaluation | CodeCode Available | 1 | 5 |
| Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula | Aug 8, 2024 | GSM8KLanguage Modeling | CodeCode Available | 1 | 5 |
| CREAM: Consistency Regularized Self-Rewarding Language Models | Oct 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers | Jul 20, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| A Realistic Threat Model for Large Language Model Jailbreaks | Oct 21, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Evaluating Morphological Alignment of Tokenizers in 70 Languages | Jul 8, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| CAT-LM: Training Language Models on Aligned Code And Tests | Oct 2, 2023 | Code GenerationLanguage Modeling | CodeCode Available | 1 | 5 |
| Evaluating Retrieval Quality in Retrieval-Augmented Generation | Apr 21, 2024 | GPULanguage Modeling | CodeCode Available | 1 | 5 |
| CPM: A Large-scale Generative Chinese Pre-trained Language Model | Dec 1, 2020 | Cloze TestLanguage Modeling | CodeCode Available | 1 | 5 |
| 4-bit Shampoo for Memory-Efficient Network Training | May 28, 2024 | image-classificationImage Classification | CodeCode Available | 1 | 5 |