| Narrative Embedding: Re-Contextualization Through Attention | Nov 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Jun 10, 2025 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 | 0 |
| NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects? | Nov 26, 2024 | AttributeMultiple-choice | —Unverified | 0 | 0 |
| AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark | Apr 14, 2025 | ManagementMultiple-choice | —Unverified | 0 | 0 |
| Network-based Representations and Dynamic Discrete Choice Models for Multiple Discrete Choice Analysis | Jun 7, 2023 | Discrete Choice ModelsMultiple-choice | —Unverified | 0 | 0 |
| WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning | May 6, 2024 | Multiple-choiceVideo Understanding | —Unverified | 0 | 0 |
| VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation | Nov 20, 2024 | ChatbotMultiple-choice | —Unverified | 0 | 0 |
| NEWSKVQA: Knowledge-Aware News Video Question Answering | Feb 8, 2022 | Common Sense ReasoningManagement | —Unverified | 0 | 0 |
| Video Instruction Tuning With Synthetic Data | Oct 3, 2024 | 3D Question Answering (3D-QA) | —Unverified | 0 | 0 |
| None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering | Mar 3, 2025 | Business EthicsEthics | —Unverified | 0 | 0 |
| None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks | Feb 18, 2025 | MathMemorization | —Unverified | 0 | 0 |
| No Task Left Behind: Multi-Task Learning of Knowledge Tracing and Option Tracing for Better Student Assessment | Apr 8, 2022 | Knowledge TracingMultiple-choice | —Unverified | 0 | 0 |
| Note on Combinatorial Engineering Frameworks for Hierarchical Modular Systems | Mar 29, 2013 | Combinatorial OptimizationMultiple-choice | —Unverified | 0 | 0 |
| Note on Evolution and Forecasting of Requirements: Communications Example | May 22, 2017 | Multiple-choice | —Unverified | 0 | 0 |
| Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning | Aug 30, 2024 | Causal Language ModelingContinual Learning | —Unverified | 0 | 0 |
| NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models | Jul 15, 2024 | Common Sense ReasoningMultiple-choice | —Unverified | 0 | 0 |
| Objective quantification of mood states using large language models | Feb 13, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities | Feb 18, 2025 | Large Language ModelMultiple-choice | —Unverified | 0 | 0 |
| OLMES: A Standard for Language Model Evaluations | Jun 12, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs | Jun 26, 2025 | DiversityMultiple-choice | —Unverified | 0 | 0 |
| Online Joint Bid/Daily Budget Optimization of Internet Advertising Campaigns | Mar 3, 2020 | Gaussian ProcessesMultiple-choice | —Unverified | 0 | 0 |
| On the application of Transformers for estimating the difficulty of Multiple-Choice Questions from text | Apr 1, 2021 | Multiple-choice | —Unverified | 0 | 0 |
| On the Performance of Multimodal Language Models | Oct 4, 2023 | BenchmarkingBinary Classification | —Unverified | 0 | 0 |
| On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| On the Reasoning Capacity of AI Models and How to Quantify It | Jan 23, 2025 | MemorizationMMLU | —Unverified | 0 | 0 |
| AGenT Zero: Zero-shot Automatic Multiple-Choice Question Generation for Skill Assessments | Nov 25, 2020 | Multiple-choiceQuestion Generation | —Unverified | 0 | 0 |
| VideoMCC: a New Benchmark for Video Comprehension | Jun 23, 2016 | Multiple-choiceVideo Description | —Unverified | 0 | 0 |
| Optimal Weighting for Exam Composition | Dec 24, 2017 | Multiple-choice | —Unverified | 0 | 0 |
| Option Comparison Network for Multiple-choice Reading Comprehension | Mar 7, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Options-Aware Dense Retrieval for Multiple-Choice query Answering | Jan 27, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Video Question Answering via Attribute-Augmented Attention Network Learning | Jul 20, 2017 | AttributeInformation Retrieval | —Unverified | 0 | 0 |
| ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models | Apr 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Order Independence With Finetuning | Mar 30, 2025 | ARCLanguage Modeling | —Unverified | 0 | 0 |
| PADDLe: a Platform to Identify Complex Words for Learners of French as a Foreign Language (FFL) | Jun 1, 2022 | Multiple-choice | —Unverified | 0 | 0 |
| Paragraph Similarity Matches for Generating Multiple-choice Test Items | Sep 1, 2021 | ManagementMultiple-choice | —Unverified | 0 | 0 |
| VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models | Feb 14, 2025 | Image CaptioningLarge Language Model | —Unverified | 0 | 0 |
| AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset | Nov 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| The AI Penalization Effect: People Reduce Compensation for Workers Who Use AI | Jan 22, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| A Foundational Multimodal Vision Language AI Assistant for Human Pathology | Dec 13, 2023 | Decision MakingDiagnostic | —Unverified | 0 | 0 |
| PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian | Feb 11, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions | Sep 12, 2023 | Multiple-choiceSentence | —Unverified | 0 | 0 |
| Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis | Jun 3, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain | May 30, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 | 0 |
| Personalised Feedback Framework for Online Education Programmes Using Generative AI | Oct 14, 2024 | BenchmarkingManagement | —Unverified | 0 | 0 |
| PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models | Jun 21, 2025 | Mathematical ReasoningMultiple-choice | —Unverified | 0 | 0 |
| Vision-Language Models Do Not Understand Negation | Jan 16, 2025 | Multiple-choiceNegation | —Unverified | 0 | 0 |
| Predicting Item Survival for Multiple Choice Questions in a High-Stakes Medical Exam | May 1, 2020 | Information RetrievalMultiple-choice | —Unverified | 0 | 0 |
| Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning | Jul 1, 2020 | Multiple-choiceTransfer Learning | —Unverified | 0 | 0 |