| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels | Nov 1, 2014 | Multiple-choice | —Unverified | 0 | 0 |
| How Susceptible are LLMs to Influence in Prompts? | Aug 17, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| How well do LLMs reason over tabular data, really? | May 12, 2025 | Missing ValuesMultiple-choice | —Unverified | 0 | 0 |
| HRCA+: Advanced Multiple-choice Machine Reading Comprehension Method | Jun 1, 2022 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| Humanity's Last Exam | Jan 24, 2025 | Humanity's Last ExamLanguage Modeling | —Unverified | 0 | 0 |
| Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators | Nov 8, 2024 | Decision MakingMultiple-choice | —Unverified | 0 | 0 |
| Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings | Jun 17, 2025 | Decision MakingLanguage Modeling | —Unverified | 0 | 0 |
| Identification of mental fatigue in language comprehension tasks based on EEG and deep learning | Apr 14, 2021 | ClassificationEEG | —Unverified | 0 | 0 |
| Treatment Effects with Multidimensional Unobserved Heterogeneity: Identification of the Marginal Treatment Effect | Sep 23, 2022 | Multiple-choice | —Unverified | 0 | 0 |
| Identifying Multiple Personalities in Large Language Models with External Evaluation | Feb 22, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words | Mar 10, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAM with Special Tokens, Re-Ranking, Siamese Encoders and Back Translation | Feb 25, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| IIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE | Jul 2, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models | Jan 1, 2025 | HallucinationMultiple-choice | —Unverified | 0 | 0 |
| Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs | May 29, 2025 | Image GenerationMultiple-choice | —Unverified | 0 | 0 |
| Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation | May 23, 2024 | Conversational RecommendationMultiple-choice | —Unverified | 0 | 0 |
| Improved Few-Shot Image Classification Through Multiple-Choice Questions | Jul 23, 2024 | ArticlesFew-Shot Image Classification | —Unverified | 0 | 0 |
| Improvement/Extension of Modular Systems as Combinatorial Reengineering (Survey) | Apr 17, 2013 | Combinatorial OptimizationMultiple-choice | —Unverified | 0 | 0 |
| Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank | Apr 19, 2024 | Distractor GenerationMath | —Unverified | 0 | 0 |
| Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack | May 21, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 | 0 |
| Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets | Sep 29, 2021 | Language ModellingMachine Reading Comprehension | —Unverified | 0 | 0 |
| Improving the Production Efficiency and Well-formedness of Automatically-Generated Multiple-Choice Cloze Vocabulary Questions | May 1, 2020 | Multiple-choice | —Unverified | 0 | 0 |
| In Case You Missed It: ARC 'Challenge' Is Not That Challenging | Dec 23, 2024 | ARCMultiple-choice | —Unverified | 0 | 0 |
| TVBench: Redesigning Video-Language Evaluation | Oct 10, 2024 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 | 0 |
| Indirect Identification of Psychosocial Risks from Natural Language | Apr 30, 2020 | Multiple-choiceTopic Models | —Unverified | 0 | 0 |
| Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection | Jan 28, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions | Oct 19, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| InnerThoughts: Disentangling Representations and Predictions in Large Language Models | Jan 29, 2025 | Multiple-choicePosition | —Unverified | 0 | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 | 0 |
| Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs | Jun 13, 2025 | Medical Question AnsweringMedQA | —Unverified | 0 | 0 |
| Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh | Feb 19, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 | 0 |
| Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages | Dec 1, 2024 | ARCMultiple-choice | —Unverified | 0 | 0 |
| Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering | Aug 6, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation | Jun 8, 2024 | Abstractive Text SummarizationDialogue Generation | —Unverified | 0 | 0 |
| Investigating Data Contamination in Modern Benchmarks for Large Language Models | Nov 16, 2023 | Common Sense ReasoningMMLU | —Unverified | 0 | 0 |
| Self-Assessment Tests are Unreliable Measures of LLM Personality | Sep 15, 2023 | Multiple-choice | —Unverified | 0 | 0 |
| Investigating the Effectiveness of ChatGPT in Mathematical Reasoning and Problem Solving: Evidence from the Vietnamese National High School Graduation Examination | Jun 10, 2023 | MathMathematical Reasoning | —Unverified | 0 | 0 |
| Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting | Oct 18, 2023 | Multiple-choice | —Unverified | 0 | 0 |
| WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | Jun 18, 2025 | document understandingMultiple-choice | —Unverified | 0 | 0 |
| ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention | Oct 1, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention | Nov 1, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora | Feb 19, 2025 | ArticlesMultiple-choice | —Unverified | 0 | 0 |
| An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System | Sep 17, 2021 | Multiple-choicesoftware testing | —Unverified | 0 | 0 |
| It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education | Mar 13, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Winning Amazon KDD Cup'24 | Aug 5, 2024 | Data AugmentationMultiple-choice | —Unverified | 0 | 0 |
| KMMLU: Measuring Massive Multitask Language Understanding in Korean | Feb 18, 2024 | kmmluLanguage Model Evaluation | —Unverified | 0 | 0 |
| Knowledge-Driven Distractor Generation for Cloze-style Multiple Choice Questions | Apr 21, 2020 | Distractor GenerationLearning-To-Rank | —Unverified | 0 | 0 |
| Knowledge Questions from Knowledge Graphs | Oct 31, 2016 | Knowledge GraphsMultiple-choice | —Unverified | 0 | 0 |
| Knowledge Retrieval Based on Generative AI | Jan 8, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 | 0 |