| Evaluating Nuanced Bias in Large Language Model Free Response Answers | Jul 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| GANDALF: a General Character Name Description Dataset for Long Fiction | Nov 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Evaluating Question Answering Evaluation | Nov 1, 2019 | Answer GenerationMultiple-choice | —Unverified | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles | Sep 23, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Evaluating the Potential of Leading Large Language Models in Reasoning Biology Questions | Nov 5, 2023 | Logical ReasoningMultiple-choice | —Unverified | 0 |
| GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis | Nov 25, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 |
| Establishing Task Scaling Laws via Compute-Efficient Model Ladders | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms | Jun 5, 2025 | Multiple-choice | —Unverified | 0 |
| Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration | Jun 24, 2024 | DiversityMultiple-choice | —Unverified | 0 |
| Evaluation of Automatically Generated Pronoun Reference Questions | Sep 1, 2017 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension | May 1, 2022 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution | Jun 22, 2023 | Multiple-choice | —Unverified | 0 |
| Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil | Aug 9, 2024 | MathMultiple-choice | —Unverified | 0 |
| Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams | Apr 4, 2025 | BenchmarkingManagement | —Unverified | 0 |
| EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta | Dec 31, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| ExplanationLP: Abductive Reasoning for Explainable Science Question Answering | Oct 25, 2020 | Answer SelectionARC | —Unverified | 0 |
| Can ChatGPT pass the Vietnamese National High School Graduation Examination? | Jun 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Answering questions by learning to rank -- Learning to rank by answering questions | Sep 2, 2019 | ARCLearning-To-Rank | —Unverified | 0 |
| Explore then Determine: A GNN-LLM Synergy Framework for Reasoning over Knowledge Graph | Jun 3, 2024 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| Can Crowdsourcing be used for Effective Annotation of Arabic? | May 1, 2014 | Entity ResolutionMultiple-choice | —Unverified | 0 |
| Generalised Winograd Schema and its Contextuality | Aug 31, 2023 | coreference-resolutionCoreference Resolution | —Unverified | 0 |
| Enhancing Multiple-Choice Question Answering with Causal Knowledge | Jun 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |