| Bridging Information-Seeking Human Gaze and Machine Reading Comprehension | Sep 30, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Adapting Vision-Language Models for Evaluating World Models | Jun 22, 2025 | Action RecognitionMultimodal Reasoning | —Unverified | 0 |
| From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models | Apr 4, 2025 | Multiple-choice | —Unverified | 0 |
| GANDALF: a General Character Name Description Dataset for Long Fiction | Nov 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Generating Diagnostic Multiple Choice Comprehension Cloze Questions | Jun 1, 2012 | DiagnosticMultiple-choice | —Unverified | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 |
| AI-based Arabic Language and Speech Tutor | Oct 22, 2022 | Multiple-choiceSelf-Learning | —Unverified | 0 |
| Answering Science Exam Questions Using Query Reformulation with Background Knowledge | Nov 17, 2018 | ARCInformation Retrieval | —Unverified | 0 |
| ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition | Oct 8, 2024 | Action RecognitionMultiple-choice | —Unverified | 0 |
| Answering Science Exam Questions Using Query Rewriting with Background Knowledge | Sep 15, 2018 | ARCInformation Retrieval | —Unverified | 0 |
| BloomVQA: Assessing Hierarchical Multi-modal Comprehension | Dec 20, 2023 | Data AugmentationMemorization | —Unverified | 0 |
| AI and Machine Learning for Next Generation Science Assessments | Apr 23, 2024 | Multiple-choice | —Unverified | 0 |
| Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth | Jun 8, 2025 | Multiple-choice | —Unverified | 0 |
| Answering Questions in Stages: Prompt Chaining for Contract QA | Oct 9, 2024 | Multiple-choice | —Unverified | 0 |
| BLINK: Multimodal Large Language Models Can See but Not Perceive | Apr 18, 2024 | Depth EstimationMultiple-choice | —Unverified | 0 |
| ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising | Dec 9, 2024 | Multiple-choiceMulti-Task Learning | —Unverified | 0 |
| Answering questions by learning to rank - Learning to rank by answering questions | Nov 1, 2019 | ARCLearning-To-Rank | —Unverified | 0 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles | Sep 23, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Establishing Task Scaling Laws via Compute-Efficient Model Ladders | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis | Jan 28, 2024 | Knowledge GraphsMedical Diagnosis | —Unverified | 0 |
| Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset | Nov 14, 2023 | Answer SelectionInformation Retrieval | —Unverified | 0 |
| Evaluating Machine Reading Systems through Comprehension Tests | May 1, 2012 | Answer SelectionMultiple-choice | —Unverified | 0 |
| EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta | Dec 31, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |