| GraDA: Graph Generative Data Augmentation for Commonsense Reasoning | Oct 1, 2022 | Data AugmentationHellaSwag | CodeCode Available | 0 |
| HellaSwag: Can a Machine Really Finish Your Sentence? | May 19, 2019 | HellaSwagNatural Language Inference | CodeCode Available | 0 |
| In-Contextual Gender Bias Suppression for Large Language Models | Sep 13, 2023 | counterfactualData Augmentation | CodeCode Available | 0 |
| On Curriculum Learning for Commonsense Reasoning | Jul 1, 2022 | HellaSwagLearning-To-Rank | CodeCode Available | 0 |
| SaGE: Evaluating Moral Consistency in Large Language Models | Feb 21, 2024 | Decision MakingHellaSwag | CodeCode Available | 0 |
| Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation | May 30, 2025 | Continual PretrainingFairness | CodeCode Available | 0 |
| metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models | Jul 4, 2024 | ARCGSM8K | CodeCode Available | 0 |
| Toward Adversarial Training on Contextualized Language Representation | May 8, 2023 | Decoderglobal-optimization | CodeCode Available | 0 |
| What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks | Apr 10, 2025 | Common Sense ReasoningHellaSwag | CodeCode Available | 0 |