| ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning | Apr 15, 2021 | Graph GenerationMultiple-choice | CodeCode Available | 1 |
| AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? | Dec 3, 2024 | Multiple-choice | CodeCode Available | 1 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| An MRC Framework for Semantic Role Labeling | Sep 14, 2021 | Computational EfficiencyMachine Reading Comprehension | CodeCode Available | 1 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain | Oct 12, 2022 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |