| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 |
| The Effect of Sampling Temperature on Problem Solving in Large Language Models | Feb 7, 2024 | Multiple-choicePrompt Engineering | CodeCode Available | 1 |
| SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | Feb 6, 2024 | AttributeFace Anti-Spoofing | CodeCode Available | 1 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 |
| LongHealth: A Question Answering Benchmark with Long Clinical Documents | Jan 25, 2024 | Information RetrievalMultiple-choice | CodeCode Available | 1 |
| The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models | Jan 11, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs Responses | Dec 26, 2023 | DiversityKnowledge Graphs | CodeCode Available | 1 |
| RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models | Dec 26, 2023 | MemorizationMultiple-choice | CodeCode Available | 1 |
| An In-depth Look at Gemini's Language Abilities | Dec 18, 2023 | Instruction FollowingMath | CodeCode Available | 1 |