| Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora | Feb 19, 2025 | ArticlesMultiple-choice | —Unverified | 0 |
| An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System | Sep 17, 2021 | Multiple-choicesoftware testing | —Unverified | 0 |
| It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education | Mar 13, 2025 | Multiple-choice | —Unverified | 0 |
| Winning Amazon KDD Cup'24 | Aug 5, 2024 | Data AugmentationMultiple-choice | —Unverified | 0 |
| KMMLU: Measuring Massive Multitask Language Understanding in Korean | Feb 18, 2024 | kmmluLanguage Model Evaluation | —Unverified | 0 |
| Knowledge-Driven Distractor Generation for Cloze-style Multiple Choice Questions | Apr 21, 2020 | Distractor GenerationLearning-To-Rank | —Unverified | 0 |
| Knowledge Questions from Knowledge Graphs | Oct 31, 2016 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| Knowledge Retrieval Based on Generative AI | Jan 8, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 |
| KoBALT: Korean Benchmark For Advanced Linguistic Tasks | May 22, 2025 | Multiple-choice | —Unverified | 0 |
| KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations | Mar 3, 2024 | MedQAMMLU | —Unverified | 0 |
| KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge | Feb 21, 2024 | 4kMultiple-choice | —Unverified | 0 |
| KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning | May 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| LAB-Bench: Measuring Capabilities of Language Models for Biology Research | Jul 14, 2024 | Language ModellingMultiple-choice | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model | Oct 1, 2024 | AllLanguage Modeling | —Unverified | 0 |
| Language models are susceptible to incorrect patient self-diagnosis in medical applications | Sep 17, 2023 | DiagnosticMultiple-choice | —Unverified | 0 |
| Uncovering Cultural Representation Disparities in Vision-Language Models | May 20, 2025 | Multiple-choice | —Unverified | 0 |
| Language Models (Mostly) Know What They Know | Jul 11, 2022 | Multiple-choice | —Unverified | 0 |
| Uncovering Temporal Context for Video Question and Answering | Nov 15, 2015 | DecoderMultiple-choice | —Unverified | 0 |
| LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights | Oct 17, 2024 | Legal ReasoningMultiple-choice | —Unverified | 0 |
| Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations | Aug 22, 2024 | Multiple-choice | —Unverified | 0 |
| Large Language Models Could Be Rote Learners | Apr 11, 2025 | MemorizationMMLU | —Unverified | 0 |
| Understanding Dataset Design Choices for Multi-hop Reasoning | Apr 27, 2019 | Multi-hop Question AnsweringMultiple-choice | —Unverified | 0 |
| Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code | Mar 9, 2023 | Multiple-choice | —Unverified | 0 |
| Large Language Models Often Know When They Are Being Evaluated | May 28, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions | Aug 22, 2023 | Multiple-choiceSensitivity | —Unverified | 0 |
| Large Language Models Still Exhibit Bias in Long Text | Oct 23, 2024 | FairnessMultiple-choice | —Unverified | 0 |
| A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology | Aug 9, 2023 | Multiple-choice | —Unverified | 0 |
| Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes | Oct 3, 2022 | Decision MakingMultiple-choice | —Unverified | 0 |
| Learning a Word-Level Language Model with Sentence-Level Noise Contrastive Estimation for Contextual Sentence Probability Estimation | Mar 14, 2021 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Learning Language-Visual Embedding for Movie Understanding with Natural-Language | Sep 26, 2016 | Multiple-choiceRetrieval | —Unverified | 0 |
| Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering | Apr 16, 2016 | General ClassificationHuman-Object Interaction Detection | —Unverified | 0 |
| Learning to Specialize with Knowledge Distillation for Visual Question Answering | Dec 1, 2018 | General ClassificationGeneral Knowledge | —Unverified | 0 |
| An AI-based Solution for Enhancing Delivery of Digital Learning for Future Teachers | Nov 9, 2021 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| LegalBench.PT: A Benchmark for Portuguese Law | Feb 22, 2025 | Multiple-choice | —Unverified | 0 |
| Teaching Pretrained Models with Commonsense Reasoning: A Preliminary KB-Based Approach | Sep 20, 2019 | Few-Shot LearningLogical Reasoning | —Unverified | 0 |
| WIQA: A dataset for ``What if...'' reasoning over procedural text | Nov 1, 2019 | Multiple-choice | —Unverified | 0 |
| LEXam: Benchmarking Legal Reasoning on 340 Law Exams | May 19, 2025 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models | Mar 19, 2024 | Multiple-choice | —Unverified | 0 |
| WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications | May 20, 2025 | Mathematical ReasoningMultiple-choice | —Unverified | 0 |
| Linguistic Legal Concept Extraction in Portuguese | Oct 22, 2018 | EthicsMultiple-choice | —Unverified | 0 |
| Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA | Oct 3, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ | Sep 25, 2024 | ChatbotGSM8K | —Unverified | 0 |
| LLM-as-a-Judge & Reward Model: What They Can and Cannot Do | Sep 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load | May 4, 2025 | ArticlesMultiple-choice | —Unverified | 0 |
| LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering | Dec 13, 2024 | Few-Shot LearningKnowledge Distillation | —Unverified | 0 |
| Unlearning vs. Obfuscation: Are We Truly Removing Knowledge? | May 5, 2025 | Multiple-choice | —Unverified | 0 |
| LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering | Jan 25, 2025 | Information RetrievalMultiple-choice | —Unverified | 0 |
| Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario | Dec 4, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLMs to Support a Domain Specific Knowledge Assistant | Feb 6, 2025 | ChatbotMultiple-choice | —Unverified | 0 |