| Reactor Mk.1 performances: MMLU, HumanEval and BBH test results | Jun 15, 2024 | BenchmarkingHumanEval | —Unverified | 0 |
| GEB-1.3B: Open Lightweight Large Language Model | Jun 14, 2024 | CPULanguage Modeling | —Unverified | 0 |
| Quantifying Variance in Evaluation Benchmarks | Jun 14, 2024 | MMLU | —Unverified | 0 |
| An Empirical Study of Mamba-based Language Models | Jun 12, 2024 | 16kIn-Context Learning | —Unverified | 0 |
| Are We Done with MMLU? | Jun 6, 2024 | MMLUVirology | CodeCode Available | 3 |
| Does your data spark joy? Performance gains from domain upsampling at the end of training | Jun 5, 2024 | GSM8KHumanEval | —Unverified | 0 |
| MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures | Jun 3, 2024 | ChatbotMMLU | —Unverified | 0 |
| Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function | Jun 3, 2024 | DiversityMMLU | CodeCode Available | 0 |
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | Jun 3, 2024 | MMLUMulti-task Language Understanding | CodeCode Available | 3 |
| Spanish and LLM Benchmarks: is MMLU Lost in Translation? | May 28, 2024 | MMLUTranslation | —Unverified | 0 |