| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| GeoSQA: A Benchmark for Scenario-based Question Answering in the Geography Domain at High School Level | Aug 20, 2019 | General KnowledgeMultiple-choice | —Unverified | 0 |
| FAMULUS: Interactive Annotation and Feedback Generation for Teaching Diagnostic Reasoning | Aug 29, 2019 | DiagnosticMultiple-choice | —Unverified | 0 |
| Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Models | Jun 18, 2024 | Multiple-choice | —Unverified | 0 |
| Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? | Mar 16, 2023 | Multiple-choice | —Unverified | 0 |
| Exposing the Limits of Video-Text Models through Contrast Sets | Jan 16, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees | Nov 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 |
| Can Crowdsourcing be used for Effective Annotation of Arabic? | May 1, 2014 | Entity ResolutionMultiple-choice | —Unverified | 0 |
| Can ChatGPT pass the Vietnamese National High School Graduation Examination? | Jun 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments | Nov 28, 2024 | Multiple-choice | —Unverified | 0 |
| Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension | Jan 16, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams | Apr 4, 2025 | BenchmarkingManagement | —Unverified | 0 |
| A Joint-Reasoning based Disease Q&A System | Jan 6, 2024 | Knowledge GraphsMisinformation | —Unverified | 0 |
| Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution | Jun 22, 2023 | Multiple-choice | —Unverified | 0 |
| Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension | May 1, 2022 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension | Jan 16, 2022 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Bridging the Language Gap: Knowledge Injected Multilingual Question Answering | Apr 6, 2023 | Cross-Lingual TransferExtractive Question-Answering | —Unverified | 0 |
| Bridging Information-Seeking Human Gaze and Machine Reading Comprehension | Sep 30, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Adapting Vision-Language Models for Evaluating World Models | Jun 22, 2025 | Action RecognitionMultimodal Reasoning | —Unverified | 0 |
| How Additional Knowledge can Improve Natural Language Commonsense Question Answering? | Sep 19, 2019 | ArticlesLanguage Modeling | —Unverified | 0 |
| Fine-tuning BERT with Focus Words for Explanation Regeneration | Dec 1, 2020 | Explanation GenerationMultiple-choice | —Unverified | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 |