| Answer-level Calibration for Free-form Multiple Choice Question Answering | May 1, 2022 | FormLanguage Modeling | CodeCode Available | 0 |
| Sentence Embeddings for Russian NLU | Oct 29, 2019 | Multiple-choiceParaphrase Identification | CodeCode Available | 0 |
| Language Models as Knowledge Bases for Visual Word Sense Disambiguation | Oct 3, 2023 | Image CaptioningMultiple-choice | CodeCode Available | 0 |
| Multimodal Residual Learning for Visual QA | Jun 5, 2016 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| QASC: A Dataset for Question Answering via Sentence Composition | Oct 25, 2019 | Common Sense ReasoningMulti-hop Question Answering | CodeCode Available | 0 |
| VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation | Aug 15, 2017 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation | May 30, 2025 | Continual PretrainingFairness | CodeCode Available | 0 |
| Every Answer Matters: Evaluating Commonsense with Probabilistic Measures | Jun 6, 2024 | Common Sense ReasoningLanguage Modeling | CodeCode Available | 0 |
| Evidence Sentence Extraction for Machine Reading Comprehension | Feb 23, 2019 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 0 |
| BertaQA: How Much Do Language Models Know About Local Culture? | Jun 11, 2024 | Multiple-choiceTransfer Learning | CodeCode Available | 0 |
| EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models | Mar 15, 2024 | MiscellaneousMultiple-choice | CodeCode Available | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| BERT-based distractor generation for Swedish reading comprehension questions using a small-scale dataset | Aug 9, 2021 | Distractor GenerationMultiple-choice | CodeCode Available | 0 |
| Quantitative Assessment of Intersectional Empathetic Bias and Understanding | Nov 8, 2024 | Multiple-choice | CodeCode Available | 0 |
| Explanatory Argument Extraction of Correct Answers in Resident Medical Exams | Dec 1, 2023 | Multiple-choice | CodeCode Available | 0 |
| Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data | Jun 4, 2024 | Clinical KnowledgeMultiple-choice | CodeCode Available | 0 |
| Cascading Biases: Investigating the Effect of Heuristic Annotation Strategies on Data and Models | Oct 24, 2022 | Multiple-choiceReading Comprehension | CodeCode Available | 0 |
| Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning | Aug 7, 2023 | In-Context LearningMath | CodeCode Available | 0 |
| Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models | Apr 2, 2024 | Distractor GenerationIn-Context Learning | CodeCode Available | 0 |
| Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models | Sep 19, 2023 | Explanation GenerationLanguage Modelling | CodeCode Available | 0 |
| Question Answering as Global Reasoning over Semantic Abstractions | Jun 9, 2019 | Information RetrievalMultiple-choice | CodeCode Available | 0 |
| KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting | Dec 1, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions | Dec 18, 2023 | Multiple-choicePedestrian Trajectory Prediction | CodeCode Available | 0 |
| Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models | Mar 30, 2025 | Knowledge GraphsMultiple-choice | CodeCode Available | 0 |
| An Automatic Question Usability Evaluation Toolkit | May 30, 2024 | Multiple-choiceWord Embeddings | CodeCode Available | 0 |
| SocialIQA: Commonsense Reasoning about Social Interactions | Apr 22, 2019 | Common Sense ReasoningCoreference Resolution | CodeCode Available | 0 |
| Questioning the Survey Responses of Large Language Models | Jun 13, 2023 | Multiple-choiceSurvey | CodeCode Available | 0 |
| Exposing the Limits of Video-Text Models through Contrast Sets | Jul 1, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Extracting Keywords from Open-Ended Business Survey Questions | Aug 31, 2018 | Multiple-choiceSurvey | CodeCode Available | 0 |
| Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | Feb 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering | Aug 28, 2018 | AI2 Reasoning ChallengeARC | CodeCode Available | 0 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| Learning to Reuse Distractors to support Multiple Choice Question Generation in Education | Oct 25, 2022 | Multiple-choiceQuestion Generation | CodeCode Available | 0 |
| "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models | Feb 22, 2024 | Multiple-choiceText Generation | CodeCode Available | 0 |
| Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A | Feb 20, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| FAT ALBERT: Finding Answers in Large Texts using Semantic Similarity Attention Layer based on BERT | Aug 22, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers | Oct 15, 2024 | Multiple-choice | CodeCode Available | 0 |
| TRACE: Transformer-based Risk Assessment for Clinical Evaluation | Nov 13, 2024 | Decision MakingMissing Values | CodeCode Available | 0 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 |
| Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores | Feb 22, 2025 | Distractor GenerationInformation Retrieval | CodeCode Available | 0 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| Training-free LLM Merging for Multi-task Learning | Jun 14, 2025 | Multiple-choiceMulti-Task Learning | CodeCode Available | 0 |
| Solving and Generating NPR Sunday Puzzles with Large Language Models | Jun 21, 2023 | Multiple-choicePrompt Engineering | CodeCode Available | 0 |
| HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models | Feb 9, 2025 | Answer GenerationLanguage Modeling | CodeCode Available | 0 |
| Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings | Jan 15, 2024 | Knowledge Graph EmbeddingsKnowledge Graphs | CodeCode Available | 0 |
| UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions | Apr 20, 2024 | Data AugmentationMultiple-choice | CodeCode Available | 0 |
| Solving Machine Learning Problems | Jul 2, 2021 | BIG-bench Machine LearningData Augmentation | CodeCode Available | 0 |
| Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? | Jul 12, 2024 | Logical ReasoningMultiple-choice | CodeCode Available | 0 |
| Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Jan 10, 2025 | Automatic Speech RecognitionClassification | CodeCode Available | 0 |