| MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models | Dec 10, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| MedG-KRP: Medical Graph Knowledge Representation Probing | Dec 14, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 | 5 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 | 5 |
| EMBRACE: Evaluation and Modifications for Boosting RACE | May 15, 2023 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 0 | 5 |
| MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback | Oct 17, 2024 | Fact VerificationHallucination | CodeCode Available | 0 | 5 |
| Measuring Agreeableness Bias in Multimodal Models | Aug 17, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 | 5 |
| ElimiNet: A Model for Eliminating Options for Reading Comprehension with Multiple Choice Questions | Apr 4, 2019 | Multiple-choiceReading Comprehension | CodeCode Available | 0 | 5 |
| Eliciting Informative Text Evaluations with Large Language Models | May 23, 2024 | Multiple-choicePrediction | CodeCode Available | 0 | 5 |
| MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models | Dec 31, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models | Sep 19, 2024 | EthicsMultiple-choice | CodeCode Available | 0 | 5 |
| A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation using GPT | Jan 13, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models | Oct 13, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 0 | 5 |
| Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think | Apr 12, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Beyond English-Only Reading Comprehension: Experiments in Zero-Shot Multilingual Transfer for Bulgarian | Aug 5, 2019 | Multiple-choicePhilosophy | CodeCode Available | 0 | 5 |
| EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants | Feb 27, 2025 | Multiple-choice | CodeCode Available | 0 | 5 |
| DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | May 29, 2025 | MMLUMultiple-choice | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| BERT-based distractor generation for Swedish reading comprehension questions using a small-scale dataset | Aug 9, 2021 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis | May 12, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension | Feb 1, 2019 | Dialogue UnderstandingMultiple-choice | CodeCode Available | 0 | 5 |
| BertaQA: How Much Do Language Models Know About Local Culture? | Jun 11, 2024 | Multiple-choiceTransfer Learning | CodeCode Available | 0 | 5 |
| LiveQA: A Question Answering Dataset over Sports Live | Oct 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings | Jan 15, 2024 | Knowledge Graph EmbeddingsKnowledge Graphs | CodeCode Available | 0 | 5 |
| HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models | Feb 9, 2025 | Answer GenerationLanguage Modeling | CodeCode Available | 0 | 5 |
| Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? | Jul 12, 2024 | Logical ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 | 5 |
| Evaluating Prompts Across Multiple Choice Tasks In a Zero-Shot Setting | Mar 29, 2022 | Multiple-choice | CodeCode Available | 0 | 5 |
| Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers | Oct 15, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 | 5 |
| Neural Natural Logic Inference for Interpretable Question Answering | Nov 1, 2021 | Multiple-choiceNatural Language Inference | CodeCode Available | 0 | 5 |
| Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT | Dec 13, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition | Dec 23, 2019 | Action RecognitionMultiple-choice | CodeCode Available | 0 | 5 |
| DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models | Oct 2, 2024 | Multiple-choiceparameter-efficient fine-tuning | CodeCode Available | 0 | 5 |
| DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions | Jun 27, 2024 | Distractor GenerationMath | CodeCode Available | 0 | 5 |
| An Information-Theoretic Approach to Analyze NLP Classification Tasks | Feb 1, 2024 | Multiple-choiceReading Comprehension | CodeCode Available | 0 | 5 |
| Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering | Aug 28, 2018 | AI2 Reasoning ChallengeARC | CodeCode Available | 0 | 5 |
| Every Answer Matters: Evaluating Commonsense with Probabilistic Measures | Jun 6, 2024 | Common Sense ReasoningLanguage Modeling | CodeCode Available | 0 | 5 |
| KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models | Oct 15, 2023 | Multiple-choiceTriplet | CodeCode Available | 0 | 5 |
| Sentence Embeddings for Russian NLU | Oct 29, 2019 | Multiple-choiceParaphrase Identification | CodeCode Available | 0 | 5 |
| Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Apr 9, 2025 | Multiple-choice | CodeCode Available | 0 | 5 |
| Distractor generation for multiple-choice questions with predictive prompting and large language models | Jul 30, 2023 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| Distractor Generation for Multiple Choice Questions Using Learning to Rank | Jun 1, 2018 | BIG-bench Machine LearningDistractor Generation | CodeCode Available | 0 | 5 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 | 5 |
| It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning | Nov 13, 2023 | Multiple-choice | CodeCode Available | 0 | 5 |
| DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking | Sep 26, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| Iterative Forward Tuning Boosts In-Context Learning in Language Models | May 22, 2023 | Decision MakingIn-Context Learning | CodeCode Available | 0 | 5 |
| Joint Learning of Sentence Embeddings for Relevance and Entailment | May 16, 2016 | Decision MakingInformation Retrieval | CodeCode Available | 0 | 5 |
| Language Models as Knowledge Bases for Visual Word Sense Disambiguation | Oct 3, 2023 | Image CaptioningMultiple-choice | CodeCode Available | 0 | 5 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 | 5 |