| Can We Guide a Multi-Hop Reasoning Language Model to Incrementally Learn at Each Single-Hop? | Oct 1, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Can multiple-choice questions really be useful in detecting the abilities of LLMs? | Mar 26, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models | May 30, 2025 | MathMultiple-choice | CodeCode Available | 0 | 5 |
| MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models | Apr 7, 2024 | Benchmarkingknowledge editing | CodeCode Available | 0 | 5 |
| Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty? | Jul 7, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| A quantitative study of NLP approaches to question difficulty estimation | May 17, 2023 | MathMultiple-choice | CodeCode Available | 0 | 5 |
| Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions | Oct 3, 2023 | MisconceptionsMultiple-choice | CodeCode Available | 0 | 5 |
| A Joint Sequence Fusion Model for Video Question Answering and Retrieval | Aug 7, 2018 | DecoderMultiple-choice | CodeCode Available | 0 | 5 |
| MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering | Nov 1, 2021 | multimodal interactionMultiple-choice | CodeCode Available | 0 | 5 |
| From Multiple-Choice to Extractive QA: A Case Study for English and Arabic | Apr 26, 2024 | BelebeleExtractive Question-Answering | CodeCode Available | 0 | 5 |
| AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles | Apr 1, 2024 | Common Sense ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| Sentence Embeddings for Russian NLU | Oct 29, 2019 | Multiple-choiceParaphrase Identification | CodeCode Available | 0 | 5 |
| BUCA: A Binary Classification Approach to Unsupervised Commonsense Question Answering | May 25, 2023 | Binary ClassificationKnowledge Graphs | CodeCode Available | 0 | 5 |
| PROST: Physical Reasoning of Objects through Space and Time | Jun 7, 2021 | Multiple-choice | CodeCode Available | 0 | 5 |
| Answer-level Calibration for Free-form Multiple Choice Question Answering | May 1, 2022 | FormLanguage Modeling | CodeCode Available | 0 | 5 |
| MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models | Dec 31, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback | Oct 17, 2024 | Fact VerificationHallucination | CodeCode Available | 0 | 5 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 | 5 |
| Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think | Apr 12, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Measuring Agreeableness Bias in Multimodal Models | Aug 17, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 | 5 |
| LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models | Oct 13, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 0 | 5 |
| Biomedical Entity Linking as Multiple Choice Question Answering | Feb 23, 2024 | Entity LinkingMultiple-choice | CodeCode Available | 0 | 5 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 | 5 |
| Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis | May 12, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? | Jul 12, 2024 | Logical ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| LiveQA: A Question Answering Dataset over Sports Live | Oct 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Eliciting Informative Text Evaluations with Large Language Models | May 23, 2024 | Multiple-choicePrediction | CodeCode Available | 0 | 5 |
| Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings | Jan 15, 2024 | Knowledge Graph EmbeddingsKnowledge Graphs | CodeCode Available | 0 | 5 |
| HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models | Feb 9, 2025 | Answer GenerationLanguage Modeling | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 | 5 |
| Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models | Sep 19, 2024 | EthicsMultiple-choice | CodeCode Available | 0 | 5 |
| A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation using GPT | Jan 13, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 | 5 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 | 5 |
| Learning to Reuse Distractors to support Multiple Choice Question Generation in Education | Oct 25, 2022 | Multiple-choiceQuestion Generation | CodeCode Available | 0 | 5 |
| Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering | Aug 28, 2018 | AI2 Reasoning ChallengeARC | CodeCode Available | 0 | 5 |
| Beyond English-Only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian | Aug 5, 2019 | Multiple-choicePhilosophy | CodeCode Available | 0 | 5 |
| EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants | Feb 27, 2025 | Multiple-choice | CodeCode Available | 0 | 5 |
| Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers | Oct 15, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | May 29, 2025 | MMLUMultiple-choice | CodeCode Available | 0 | 5 |
| BERT-based distractor generation for Swedish reading comprehension questions using a small-scale dataset | Aug 9, 2021 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 | 5 |
| DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension | Feb 1, 2019 | Dialogue UnderstandingMultiple-choice | CodeCode Available | 0 | 5 |
| ElimiNet: A Model for Eliminating Options for Reading Comprehension with Multiple Choice Questions | Apr 4, 2019 | Multiple-choiceReading Comprehension | CodeCode Available | 0 | 5 |
| BertaQA: How Much Do Language Models Know About Local Culture? | Jun 11, 2024 | Multiple-choiceTransfer Learning | CodeCode Available | 0 | 5 |
| EMBRACE: Evaluation and Modifications for Boosting RACE | May 15, 2023 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 0 | 5 |
| Language Models as Knowledge Bases for Visual Word Sense Disambiguation | Oct 3, 2023 | Image CaptioningMultiple-choice | CodeCode Available | 0 | 5 |
| It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning | Nov 13, 2023 | Multiple-choice | CodeCode Available | 0 | 5 |