| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 |
| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | Jan 17, 2025 | FairnessMultiple-choice | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 |
| AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading Comprehension | Mar 16, 2022 | Logical ReasoningMachine Reading Comprehension | CodeCode Available | 1 |
| Evaluating the Knowledge Dependency of Questions | Nov 21, 2022 | Multiple-choice | CodeCode Available | 1 |
| All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages | Nov 25, 2024 | AllLong Question Answer | CodeCode Available | 1 |
| From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap | Apr 13, 2020 | Dialogue State TrackingMachine Reading Comprehension | CodeCode Available | 1 |
| Generating Distractors for Reading Comprehension Questions from Real Examinations | Sep 8, 2018 | DecoderDistractor Generation | CodeCode Available | 1 |
| FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue | May 12, 2022 | Dialogue UnderstandingDomain Adaptation | CodeCode Available | 1 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework | Jul 24, 2023 | Contrastive LearningMultimodal Reasoning | CodeCode Available | 1 |
| Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom | Apr 30, 2024 | ImplicaturesMultiple-choice | CodeCode Available | 1 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | Jun 13, 2024 | Multiple-choiceVisual Reasoning | CodeCode Available | 1 |
| A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies | Oct 12, 2020 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |
| JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning | Oct 16, 2023 | Domain AdaptationMedical Question Answering | CodeCode Available | 1 |
| Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | May 7, 2023 | Multiple-choice | CodeCode Available | 1 |
| Assessing the Chemical Intelligence of Large Language Models | May 12, 2025 | Multiple-choice | CodeCode Available | 1 |
| EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain | Oct 12, 2022 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |
| Leaf: Multiple-Choice Question Generation | Jan 22, 2022 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 |