| Beyond English-Only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian | Aug 5, 2019 | Multiple-choicePhilosophy | CodeCode Available | 0 |
| A quantitative study of NLP approaches to question difficulty estimation | May 17, 2023 | MathMultiple-choice | CodeCode Available | 0 |
| Unified Question Answering in Slovene | Nov 16, 2022 | Cross-Lingual TransferDecoder | CodeCode Available | 0 |
| Neural Natural Logic Inference for Interpretable Question Answering | Nov 1, 2021 | Multiple-choiceNatural Language Inference | CodeCode Available | 0 |
| Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis | May 12, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain | Apr 9, 2023 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| Real-Time Automated Answer Scoring | Oct 13, 2022 | Multiple-choice | CodeCode Available | 0 |
| Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions | May 30, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| LiveQA: A Question Answering Dataset over Sports Live | Oct 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| CASE: Commonsense-Augmented Score with an Expanded Answer Space | Nov 3, 2023 | Multiple-choice | CodeCode Available | 0 |
| Which Shortcut Solution Do Question Answering Models Prefer to Learn? | Nov 29, 2022 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| From Recognition to Cognition: Visual Commonsense Reasoning | Nov 27, 2018 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding | Jan 1, 2025 | Action RecognitionMultiple-choice | CodeCode Available | 0 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 |
| Fusing Models with Complementary Expertise | Oct 2, 2023 | Multiple-choicetext-classification | CodeCode Available | 0 |
| A Benchmark for Long-Form Medical Question Answering | Nov 14, 2024 | Answer GenerationForm | CodeCode Available | 0 |
| Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation | Mar 20, 2025 | Multiple-choiceText Generation | CodeCode Available | 0 |
| ReCoMIF: Reading comprehension based multi-source information fusion network for Chinese spoken language understanding | Aug 1, 2023 | Intent DetectionMultiple-choice | CodeCode Available | 0 |
| NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QA | Apr 4, 2024 | Multiple-choice | CodeCode Available | 0 |
| Gendered Pronoun Resolution using BERT and an extractive question answering formulation | Jun 9, 2019 | coreference-resolutionCoreference Resolution | CodeCode Available | 0 |
| Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models | Dec 2, 2024 | MMLUMultiple-choice | CodeCode Available | 0 |
| Spoken Language Intelligence of Large Language Models for Language Learning | Aug 28, 2023 | Language AcquisitionMultiple-choice | CodeCode Available | 0 |
| ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant | May 6, 2025 | DescriptiveMultiple-choice | CodeCode Available | 0 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 |
| Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions | Jun 16, 2024 | Decision MakingLanguage Modelling | CodeCode Available | 0 |
| What Makes Reading Comprehension Questions Difficult? | Mar 12, 2022 | Logical ReasoningMultiple-choice | CodeCode Available | 0 |
| Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options | Aug 27, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes | Sep 6, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| An Information-Theoretic Approach to Analyze NLP Classification Tasks | Feb 1, 2024 | Multiple-choiceReading Comprehension | CodeCode Available | 0 |
| World Knowledge in Multiple Choice Reading Comprehension | Nov 13, 2022 | General KnowledgeMultiple-choice | CodeCode Available | 0 |
| NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models | Oct 11, 2024 | Multiple-choiceTruthfulQA | CodeCode Available | 0 |
| Are Large Language Models Consistent over Value-laden Questions? | Jul 3, 2024 | Multiple-choice | CodeCode Available | 0 |
| Revisiting Visual Question Answering Baselines | Jun 27, 2016 | Binary ClassificationMultiple-choice | CodeCode Available | 0 |
| LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models | Oct 13, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 0 |
| BUCA: A Binary Classification Approach to Unsupervised Commonsense Question Answering | May 25, 2023 | Binary ClassificationKnowledge Graphs | CodeCode Available | 0 |
| Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models | Apr 11, 2024 | Multiple-choiceReading Comprehension | CodeCode Available | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| Abductive Commonsense Reasoning | Aug 15, 2019 | Multiple-choiceNatural Language Inference | CodeCode Available | 0 |
| A Multiple Choices Reading Comprehension Corpus for Vietnamese Language Education | Mar 31, 2023 | ArticlesMachine Reading Comprehension | CodeCode Available | 0 |
| When an LLM is apprehensive about its answers -- and when its uncertainty is justified | Mar 3, 2025 | MathMMLU | CodeCode Available | 0 |
| Grade Score: Quantifying LLM Performance in Option Selection | Jun 17, 2024 | Decision MakingFairness | CodeCode Available | 0 |
| Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think | Apr 12, 2024 | Multiple-choice | CodeCode Available | 0 |
| This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs | Mar 7, 2025 | Large Language ModelMultiple-choice | CodeCode Available | 0 |
| StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding | Oct 19, 2023 | Multiple-choiceNatural Language Understanding | CodeCode Available | 0 |
| Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora | May 13, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| From Multiple-Choice to Extractive QA: A Case Study for English and Arabic | Apr 26, 2024 | BelebeleExtractive Question-Answering | CodeCode Available | 0 |
| ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning | Feb 7, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice Selectors | Jun 3, 2024 | Multiple-choiceSelection bias | CodeCode Available | 0 |
| QMOS: Enhancing LLMs for Telecommunication with Question Masked loss and Option Shuffling | Sep 21, 2024 | Multiple-choicePrompt Engineering | CodeCode Available | 0 |
| Truth Knows No Language: Evaluating Truthfulness Beyond English | Feb 13, 2025 | InformativenessMachine Translation | CodeCode Available | 0 |