| On the application of Transformers for estimating the difficulty of Multiple-Choice Questions from text | Apr 1, 2021 | Multiple-choice | —Unverified | 0 |
| On the Performance of Multimodal Language Models | Oct 4, 2023 | BenchmarkingBinary Classification | —Unverified | 0 |
| On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| On the Reasoning Capacity of AI Models and How to Quantify It | Jan 23, 2025 | MemorizationMMLU | —Unverified | 0 |
| AGenT Zero: Zero-shot Automatic Multiple-Choice Question Generation for Skill Assessments | Nov 25, 2020 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| VideoMCC: a New Benchmark for Video Comprehension | Jun 23, 2016 | Multiple-choiceVideo Description | —Unverified | 0 |
| Optimal Weighting for Exam Composition | Dec 24, 2017 | Multiple-choice | —Unverified | 0 |
| Option Comparison Network for Multiple-choice Reading Comprehension | Mar 7, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Options-Aware Dense Retrieval for Multiple-Choice query Answering | Jan 27, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Video Question Answering via Attribute-Augmented Attention Network Learning | Jul 20, 2017 | AttributeInformation Retrieval | —Unverified | 0 |
| ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models | Apr 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Order Independence With Finetuning | Mar 30, 2025 | ARCLanguage Modeling | —Unverified | 0 |
| PADDLe: a Platform to Identify Complex Words for Learners of French as a Foreign Language (FFL) | Jun 1, 2022 | Multiple-choice | —Unverified | 0 |
| Paragraph Similarity Matches for Generating Multiple-choice Test Items | Sep 1, 2021 | ManagementMultiple-choice | —Unverified | 0 |
| VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models | Feb 14, 2025 | Image CaptioningLarge Language Model | —Unverified | 0 |
| AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset | Nov 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| The AI Penalization Effect: People Reduce Compensation for Workers Who Use AI | Jan 22, 2025 | Multiple-choice | —Unverified | 0 |
| Perception Test 2023: A Summary of the First Challenge And Outcome | Dec 20, 2023 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark | Nov 29, 2024 | BenchmarkingGrounded Video Question Answering | —Unverified | 0 |
| A Foundational Multimodal Vision Language AI Assistant for Human Pathology | Dec 13, 2023 | Decision MakingDiagnostic | —Unverified | 0 |
| PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian | Feb 11, 2025 | Multiple-choice | —Unverified | 0 |
| Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions | Sep 12, 2023 | Multiple-choiceSentence | —Unverified | 0 |
| Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis | Jun 3, 2025 | Multiple-choice | —Unverified | 0 |
| PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain | May 30, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 |
| Personalised Feedback Framework for Online Education Programmes Using Generative AI | Oct 14, 2024 | BenchmarkingManagement | —Unverified | 0 |
| PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models | Jun 21, 2025 | Mathematical ReasoningMultiple-choice | —Unverified | 0 |
| Vision-Language Models Do Not Understand Negation | Jan 16, 2025 | Multiple-choiceNegation | —Unverified | 0 |
| Predicting Item Survival for Multiple Choice Questions in a High-Stakes Medical Exam | May 1, 2020 | Information RetrievalMultiple-choice | —Unverified | 0 |
| When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards | Feb 1, 2024 | Answer SelectionLanguage Modeling | CodeCode Available | 0 |
| Video Prediction via Selective Sampling | Dec 1, 2018 | Multiple-choicePrediction | CodeCode Available | 0 |
| MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback | Oct 17, 2024 | Fact VerificationHallucination | CodeCode Available | 0 |
| CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models | Jun 7, 2024 | Multiple-choicePhilosophy | CodeCode Available | 0 |
| Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning | May 30, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| Automating Turkish Educational Quiz Generation Using Large Language Models | Jun 5, 2024 | Multiple-choice | CodeCode Available | 0 |
| How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? | Oct 21, 2024 | counterfactualDecision Making | CodeCode Available | 0 |
| Measuring Agreeableness Bias in Multimodal Models | Aug 17, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| CSEPrompts: A Benchmark of Introductory Computer Science Prompts | Apr 3, 2024 | Multiple-choice | CodeCode Available | 0 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| MedG-KRP: Medical Graph Knowledge Representation Probing | Dec 14, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| How much do LLMs learn from negative examples? | Mar 18, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| CNN for Text-Based Multiple Choice Question Answering | Jul 1, 2018 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine | Nov 14, 2024 | FormHallucination | CodeCode Available | 0 |
| Confident Multiple Choice Learning | Jun 12, 2017 | General Classificationimage-classification | CodeCode Available | 0 |
| VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models | Mar 10, 2025 | Image DescriptionMultiple-choice | CodeCode Available | 0 |
| A Simple Method for Commonsense Reasoning | Jun 7, 2018 | Common Sense ReasoningCoreference Resolution | CodeCode Available | 0 |
| Chance-Constrained Multiple-Choice Knapsack Problem: Model, Algorithms, and Applications | Jun 26, 2023 | Combinatorial OptimizationMultiple-choice | CodeCode Available | 0 |
| Biomedical Entity Linking as Multiple Choice Question Answering | Feb 23, 2024 | Entity LinkingMultiple-choice | CodeCode Available | 0 |
| (WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and Challenges | Jan 3, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| DE-COP: Detecting Copyrighted Content in Language Models Training Data | Feb 15, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Patent Figure Classification using Large Vision-language Models | Jan 22, 2025 | ClassificationFew-Shot Learning | CodeCode Available | 0 |