| Unsupervised Commonsense Question Answering with Self-Talk | Apr 11, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| R2DE: a NLP approach to estimating IRT parameters of newly generated questions | Jan 21, 2020 | Multiple-choiceQuestion Generation | CodeCode Available | 1 |
| WIQA: A dataset for "What if..." reasoning over procedural text | Sep 10, 2019 | Multiple-choice | CodeCode Available | 1 |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | Nov 2, 2018 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 |
| Generating Distractors for Reading Comprehension Questions from Real Examinations | Sep 8, 2018 | DecoderDistractor Generation | CodeCode Available | 1 |
| Constructing Narrative Event Evolutionary Graph for Script Event Prediction | May 14, 2018 | Graph Neural NetworkMultiple-choice | CodeCode Available | 1 |
| VQA: Visual Question Answering | May 3, 2015 | Image CaptioningMultiple-choice | CodeCode Available | 1 |
| The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations | Jul 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models | Jul 17, 2025 | Multiple-choice | —Unverified | 0 |
| MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks | Jul 3, 2025 | FairnessMultiple-choice | —Unverified | 0 |