| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 | 5 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 | 5 |
| Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | May 7, 2023 | Multiple-choice | CodeCode Available | 1 | 5 |
| Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework | Jul 24, 2023 | Contrastive LearningMultimodal Reasoning | CodeCode Available | 1 | 5 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 | 5 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| BRAINTEASER: Lateral Thinking Puzzles for Large Language Models | Oct 8, 2023 | Distractor GenerationLanguage Modelling | CodeCode Available | 1 | 5 |
| ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering | Apr 7, 2025 | Chart Question AnsweringChart Understanding | CodeCode Available | 1 | 5 |
| IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages | Nov 8, 2020 | Genre classificationMultiple-choice | CodeCode Available | 1 | 5 |