| Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge | Mar 14, 2024 | Multiple-choice | —Unverified | 0 |
| Rethinking Generative Large Language Model Evaluation for Semantic Comprehension | Mar 12, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 |
| MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding | Mar 11, 2024 | Dialogue GenerationMultiple-choice | —Unverified | 0 |
| Unfamiliar Finetuning Examples Control How Language Models Hallucinate | Mar 8, 2024 | MMLUMultiple-choice | CodeCode Available | 1 |
| The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning | Mar 5, 2024 | Multiple-choice | CodeCode Available | 4 |
| An Improved Traditional Chinese Evaluation Suite for Foundation Model | Mar 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5 | Mar 4, 2024 | Multiple-choicePart-Of-Speech Tagging | —Unverified | 0 |
| To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering | Mar 4, 2024 | MedQAMMLU | CodeCode Available | 1 |
| KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations | Mar 3, 2024 | MedQAMMLU | —Unverified | 0 |