| Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam | Aug 1, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods | Mar 1, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability | Nov 10, 2024 | Multiple-choiceText Generation | —Unverified | 0 | 0 |
| Prompt Engineering and Calibration for Zero-Shot Commonsense Reasoning | Apr 14, 2023 | Multiple-choicePrompt Engineering | —Unverified | 0 | 0 |
| Prompting Implicit Discourse Relation Annotation | Feb 7, 2024 | ClassificationImplicit Discourse Relation Classification | —Unverified | 0 | 0 |
| Instruction Fine-Tuning: Does Prompt Loss Matter? | Jan 24, 2024 | Multiple-choicetoken-classification | —Unverified | 0 | 0 |
| ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding | Nov 7, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology | Nov 16, 2023 | MMLUMultiple-choice | —Unverified | 0 | 0 |
| PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities | Jan 13, 2024 | Instruction FollowingMultiple-choice | —Unverified | 0 | 0 |
| Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs | Sep 30, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs | Jan 1, 2025 | Multiple-choiceVideo Generation | —Unverified | 0 | 0 |
| QOG:Question and Options Generation based on Language Model | Jun 18, 2024 | Information RetrievalLanguage Modeling | —Unverified | 0 | 0 |
| QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism | Jun 19, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| VisNumBench: Evaluating Number Sense of Multimodal Large Language Models | Mar 19, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Query Rewriting for Retrieval-Augmented Large Language Models | May 23, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Question Difficulty Ranking for Multiple-Choice Reading Comprehension | Apr 16, 2024 | Multiple-choiceReading Comprehension | —Unverified | 0 | 0 |
| Question-type Identification for Academic Questions in Online Learning Platform | Nov 24, 2022 | Binary ClassificationMultiple-choice | —Unverified | 0 | 0 |
| Visual7W: Grounded Question Answering in Images | Nov 11, 2015 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 | 0 |
| Ranking Facts for Explaining Answers to Elementary Science Questions | Oct 18, 2021 | Interpretable Machine LearningLearning-To-Rank | —Unverified | 0 | 0 |
| Ranking Large Language Models without Ground Truth | Feb 21, 2024 | Multiple-choiceTriplet | —Unverified | 0 | 0 |
| Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking | Jan 7, 2021 | Entity LinkingMachine Reading Comprehension | —Unverified | 0 | 0 |
| RECAP-KG: Mining Knowledge Graphs from Raw GP Notes for Remote COVID-19 Assessment in Primary Care | Jun 17, 2023 | Decision Makinggraph construction | —Unverified | 0 | 0 |
| Receptivity of an AI Cognitive Assistant by the Radiology Community: A Report on Data Collected at RSNA | Sep 13, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Recurrent and Contextual Models for Visual Question Answering | Mar 23, 2017 | DiversityMultiple-choice | —Unverified | 0 | 0 |
| Visual Madlibs: Fill in the Blank Description Generation and Question Answering | Dec 1, 2015 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Rethinking AI Cultural Alignment | Jan 13, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Rethinking Generative Large Language Model Evaluation for Semantic Comprehension | Mar 12, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Reusing Swedish FrameNet for training semantic roles | May 1, 2014 | Multiple-choice | —Unverified | 0 | 0 |
| Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions | Feb 25, 2025 | Inductive BiasLogical Reasoning | —Unverified | 0 | 0 |
| RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge | Jan 2, 2021 | counterfactualCounterfactual Reasoning | —Unverified | 0 | 0 |
| RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation | Sep 24, 2024 | Multiple-choiceSentence | —Unverified | 0 | 0 |
| R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest | Oct 27, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 | 0 |
| Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | May 21, 2025 | Dataset GenerationDescriptive | —Unverified | 0 | 0 |
| Robust portfolio optimization model for electronic coupon allocation | May 21, 2024 | Multiple-choicePortfolio Optimization | —Unverified | 0 | 0 |
| Visual Madlibs: Fill in the blank Image Generation and Question Answering | May 31, 2015 | Image GenerationMultiple-choice | —Unverified | 0 | 0 |
| SafePath: Conformal Prediction for Safe LLM-Based Autonomous Navigation | May 14, 2025 | Autonomous DrivingAutonomous Navigation | —Unverified | 0 | 0 |
| Adversarial Training for Machine Reading Comprehension with Virtual Embeddings | Jun 8, 2021 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Visual Question Answering as Reading Comprehension | Nov 29, 2018 | Common Sense ReasoningGeneral Knowledge | —Unverified | 0 | 0 |
| Adversarial Databases Improve Success in Retrieval-based Large Language Models | Jul 19, 2024 | Multiple-choiceRAG | —Unverified | 0 | 0 |
| SaL-Lightning Dataset: Search and Eye Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search | Jan 7, 2022 | Information RetrievalMultiple-choice | —Unverified | 0 | 0 |
| Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models | Oct 10, 2024 | Conformal PredictionLanguage Modeling | —Unverified | 0 | 0 |
| SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning | Apr 22, 2025 | Multiple-choicereinforcement-learning | —Unverified | 0 | 0 |
| SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia | Mar 21, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models | Feb 12, 2025 | FairnessMultiple-choice | —Unverified | 0 | 0 |
| SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark | Feb 6, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Scene Restoring for Narrative Machine Reading Comprehension | Nov 1, 2020 | Cloze TestMachine Reading Comprehension | —Unverified | 0 | 0 |
| Scheduling Algorithms for Federated Learning with Minimal Energy Consumption | Sep 13, 2022 | Federated LearningMultiple-choice | —Unverified | 0 | 0 |
| VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare | Feb 19, 2025 | BenchmarkingDiversity | —Unverified | 0 | 0 |
| GeoSQA: A Benchmark for Scenario-based Question Answering in the Geography Domain at High School Level | Aug 20, 2019 | General KnowledgeMultiple-choice | —Unverified | 0 | 0 |