| Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora | Feb 19, 2025 | ArticlesMultiple-choice | —Unverified | 0 |
| An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System | Sep 17, 2021 | Multiple-choicesoftware testing | —Unverified | 0 |
| It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education | Mar 13, 2025 | Multiple-choice | —Unverified | 0 |
| Winning Amazon KDD Cup'24 | Aug 5, 2024 | Data AugmentationMultiple-choice | —Unverified | 0 |
| KMMLU: Measuring Massive Multitask Language Understanding in Korean | Feb 18, 2024 | kmmluLanguage Model Evaluation | —Unverified | 0 |
| Knowledge-Driven Distractor Generation for Cloze-style Multiple Choice Questions | Apr 21, 2020 | Distractor GenerationLearning-To-Rank | —Unverified | 0 |
| Knowledge Questions from Knowledge Graphs | Oct 31, 2016 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| Knowledge Retrieval Based on Generative AI | Jan 8, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 |
| KoBALT: Korean Benchmark For Advanced Linguistic Tasks | May 22, 2025 | Multiple-choice | —Unverified | 0 |
| KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations | Mar 3, 2024 | MedQAMMLU | —Unverified | 0 |
| KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge | Feb 21, 2024 | 4kMultiple-choice | —Unverified | 0 |
| KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning | May 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| LAB-Bench: Measuring Capabilities of Language Models for Biology Research | Jul 14, 2024 | Language ModellingMultiple-choice | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model | Oct 1, 2024 | AllLanguage Modeling | —Unverified | 0 |
| Language models are susceptible to incorrect patient self-diagnosis in medical applications | Sep 17, 2023 | DiagnosticMultiple-choice | —Unverified | 0 |
| Uncovering Cultural Representation Disparities in Vision-Language Models | May 20, 2025 | Multiple-choice | —Unverified | 0 |
| Language Models (Mostly) Know What They Know | Jul 11, 2022 | Multiple-choice | —Unverified | 0 |
| Uncovering Temporal Context for Video Question and Answering | Nov 15, 2015 | DecoderMultiple-choice | —Unverified | 0 |
| LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights | Oct 17, 2024 | Legal ReasoningMultiple-choice | —Unverified | 0 |
| Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations | Aug 22, 2024 | Multiple-choice | —Unverified | 0 |
| Large Language Models Could Be Rote Learners | Apr 11, 2025 | MemorizationMMLU | —Unverified | 0 |
| Understanding Dataset Design Choices for Multi-hop Reasoning | Apr 27, 2019 | Multi-hop Question AnsweringMultiple-choice | —Unverified | 0 |
| Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code | Mar 9, 2023 | Multiple-choice | —Unverified | 0 |
| Large Language Models Often Know When They Are Being Evaluated | May 28, 2025 | MMLUMultiple-choice | —Unverified | 0 |