| metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models | Jul 4, 2024 | ARCGSM8K | CodeCode Available | 0 | 5 |
| Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations in a Label-Abundant Setup | Dec 12, 2021 | Natural Language InferenceTransfer Learning | CodeCode Available | 0 | 5 |
| Who's Harry Potter? Approximate Unlearning in LLMs | Oct 3, 2023 | ARCGPU | —Unverified | 0 | 0 |
| An Application of Pseudo-Log-Likelihoods to Natural Language Scoring | Jan 23, 2022 | Common Sense ReasoningGPU | —Unverified | 0 | 0 |
| WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization | Mar 31, 2025 | Common Sense ReasoningMemorization | —Unverified | 0 | 0 |
| A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language Models | Jan 14, 2022 | Constituency ParsingGrammatical Error Detection | —Unverified | 0 | 0 |
| A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models | Jun 1, 2022 | Constituency ParsingGrammatical Error Detection | —Unverified | 0 | 0 |
| Elastic Weight Consolidation for Full-Parameter Continual Pre-Training of Gemma2 | May 9, 2025 | ARCBelebele | —Unverified | 0 | 0 |
| Judgment of Thoughts: Courtroom of the Binary Logical Reasoning in Large Language Models | Sep 25, 2024 | Fake News DetectionLanguage Modeling | —Unverified | 0 | 0 |
| More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment | Apr 3, 2025 | ARCHellaSwag | —Unverified | 0 | 0 |