| ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning | Apr 15, 2021 | Graph GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property | Feb 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| NarrativeXL: A Large-scale Dataset For Long-Term Memory Models | May 23, 2023 | Multiple-choiceReading Comprehension | CodeCode Available | 1 | 5 |
| MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Oct 14, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Option Tracing: Beyond Binary Knowledge Tracing | Dec 11, 2020 | Knowledge TracingMultiple-choice | CodeCode Available | 1 | 5 |
| The Effect of Sampling Temperature on Problem Solving in Large Language Models | Feb 7, 2024 | Multiple-choicePrompt Engineering | CodeCode Available | 1 | 5 |
| Unsupervised Commonsense Question Answering with Self-Talk | Apr 11, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| A Study on Large Language Models' Limitations in Multiple-Choice Question Answering | Jan 15, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models | Apr 7, 2024 | Benchmarkingknowledge editing | CodeCode Available | 0 | 5 |
| Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction | Mar 5, 2025 | In-Context LearningMultiple-choice | CodeCode Available | 0 | 5 |
| Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models | May 30, 2025 | MathMultiple-choice | CodeCode Available | 0 | 5 |
| Confident Multiple Choice Learning | Jun 12, 2017 | General Classificationimage-classification | CodeCode Available | 0 | 5 |
| Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods | Jul 16, 2023 | Multiple-choice | CodeCode Available | 0 | 5 |
| MMM: Multi-stage Multi-task Learning for Multi-choice Reading Comprehension | Oct 1, 2019 | Logical ReasoningMachine Reading Comprehension | CodeCode Available | 0 | 5 |
| COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes | Sep 6, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering | Nov 1, 2021 | multimodal interactionMultiple-choice | CodeCode Available | 0 | 5 |
| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 | 5 |
| A Simple Method for Commonsense Reasoning | Jun 7, 2018 | Common Sense ReasoningCoreference Resolution | CodeCode Available | 0 | 5 |
| MedG-KRP: Medical Graph Knowledge Representation Probing | Dec 14, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 | 5 |
| MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback | Oct 17, 2024 | Fact VerificationHallucination | CodeCode Available | 0 | 5 |
| A Benchmark for Long-Form Medical Question Answering | Nov 14, 2024 | Answer GenerationForm | CodeCode Available | 0 | 5 |
| Measuring Agreeableness Bias in Multimodal Models | Aug 17, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 | 5 |
| MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models | Dec 31, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| CNN for Text-Based Multiple Choice Question Answering | Jul 1, 2018 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| A Multiple Choices Reading Comprehension Corpus for Vietnamese Language Education | Mar 31, 2023 | ArticlesMachine Reading Comprehension | CodeCode Available | 0 | 5 |
| Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? | Feb 19, 2024 | Decision MakingMemorization | CodeCode Available | 0 | 5 |
| Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think | Apr 12, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| CLOMO: Counterfactual Logical Modification with Large Language Models | Nov 29, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 | 5 |
| ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning | Feb 7, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models | Oct 13, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis | May 12, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| LiveQA: A Question Answering Dataset over Sports Live | Oct 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| ChatGPT for GTFS: Benchmarking LLMs on GTFS Understanding and Retrieval | Aug 4, 2023 | BenchmarkingInformation Retrieval | CodeCode Available | 0 | 5 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 | 5 |
| Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? | Jul 12, 2024 | Logical ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| Chain-of-Exemplar: Enhancing Distractor Generation for Multimodal Educational Question Generation | Aug 16, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| Are Large Language Models Consistent over Value-laden Questions? | Jul 3, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings | Jan 15, 2024 | Knowledge Graph EmbeddingsKnowledge Graphs | CodeCode Available | 0 | 5 |
| HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models | Feb 9, 2025 | Answer GenerationLanguage Modeling | CodeCode Available | 0 | 5 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 | 5 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 | 5 |
| CASE: Commonsense-Augmented Score with an Expanded Answer Space | Nov 3, 2023 | Multiple-choice | CodeCode Available | 0 | 5 |
| Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models | Oct 24, 2022 | Multiple-choiceReading Comprehension | CodeCode Available | 0 | 5 |
| Abductive Commonsense Reasoning | Aug 15, 2019 | Multiple-choiceNatural Language Inference | CodeCode Available | 0 | 5 |
| A large language model-assisted education tool to provide feedback on open-ended responses | Jul 25, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Can We Guide a Multi-Hop Reasoning Language Model to Incrementally Learn at Each Single-Hop? | Oct 1, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 | 5 |