| Can large language models reason about medical questions? | Jul 17, 2022 | MedQAMultiple-choice | CodeCode Available | 1 | 5 |
| Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models | Aug 18, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 | 5 |
| FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue | May 12, 2022 | Dialogue UnderstandingDomain Adaptation | CodeCode Available | 1 | 5 |
| Option Tracing: Beyond Correctness Analysis in Knowledge Tracing | Apr 19, 2021 | Knowledge TracingMultiple-choice | CodeCode Available | 1 | 5 |
| ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks | Jul 8, 2024 | Anomaly DetectionCode Generation | CodeCode Available | 1 | 5 |
| AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models | Feb 24, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models | Sep 5, 2023 | Code GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning | Oct 16, 2023 | Domain AdaptationMedical Question Answering | CodeCode Available | 1 | 5 |
| Large Language Models Encode Clinical Knowledge | Dec 26, 2022 | Clinical KnowledgeMedQA | CodeCode Available | 1 | 5 |