| AGenT Zero: Zero-shot Automatic Multiple-Choice Question Generation for Skill Assessments | Nov 25, 2020 | Multiple-choiceQuestion Generation | —Unverified | 0 | 0 |
| VideoMCC: a New Benchmark for Video Comprehension | Jun 23, 2016 | Multiple-choiceVideo Description | —Unverified | 0 | 0 |
| Optimal Weighting for Exam Composition | Dec 24, 2017 | Multiple-choice | —Unverified | 0 | 0 |
| Option Comparison Network for Multiple-choice Reading Comprehension | Mar 7, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Options-Aware Dense Retrieval for Multiple-Choice query Answering | Jan 27, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Video Question Answering via Attribute-Augmented Attention Network Learning | Jul 20, 2017 | AttributeInformation Retrieval | —Unverified | 0 | 0 |
| ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models | Apr 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Order Independence With Finetuning | Mar 30, 2025 | ARCLanguage Modeling | —Unverified | 0 | 0 |
| PADDLe: a Platform to Identify Complex Words for Learners of French as a Foreign Language (FFL) | Jun 1, 2022 | Multiple-choice | —Unverified | 0 | 0 |
| Paragraph Similarity Matches for Generating Multiple-choice Test Items | Sep 1, 2021 | ManagementMultiple-choice | —Unverified | 0 | 0 |
| VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models | Feb 14, 2025 | Image CaptioningLarge Language Model | —Unverified | 0 | 0 |
| AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset | Nov 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| The AI Penalization Effect: People Reduce Compensation for Workers Who Use AI | Jan 22, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 | 0 |
| A Foundational Multimodal Vision Language AI Assistant for Human Pathology | Dec 13, 2023 | Decision MakingDiagnostic | —Unverified | 0 | 0 |
| PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian | Feb 11, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions | Sep 12, 2023 | Multiple-choiceSentence | —Unverified | 0 | 0 |
| Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis | Jun 3, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain | May 30, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 | 0 |
| Personalised Feedback Framework for Online Education Programmes Using Generative AI | Oct 14, 2024 | BenchmarkingManagement | —Unverified | 0 | 0 |
| PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models | Jun 21, 2025 | Mathematical ReasoningMultiple-choice | —Unverified | 0 | 0 |
| Vision-Language Models Do Not Understand Negation | Jan 16, 2025 | Multiple-choiceNegation | —Unverified | 0 | 0 |
| Predicting Item Survival for Multiple Choice Questions in a High-Stakes Medical Exam | May 1, 2020 | Information RetrievalMultiple-choice | —Unverified | 0 | 0 |
| Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning | Jul 1, 2020 | Multiple-choiceTransfer Learning | —Unverified | 0 | 0 |