| Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge | Mar 14, 2024 | Multiple-choice | —Unverified | 0 |
| AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic | Mar 14, 2024 | EthicsMultiple-choice | —Unverified | 0 |
| Rethinking Generative Large Language Model Evaluation for Semantic Comprehension | Mar 12, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding | Mar 11, 2024 | Dialogue GenerationMultiple-choice | —Unverified | 0 |
| Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5 | Mar 4, 2024 | Multiple-choicePart-Of-Speech Tagging | —Unverified | 0 |
| An Improved Traditional Chinese Evaluation Suite for Foundation Model | Mar 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations | Mar 3, 2024 | MedQAMMLU | —Unverified | 0 |
| Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment | Mar 3, 2024 | Cloze TestMultiple-choice | —Unverified | 0 |
| Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods | Mar 1, 2024 | Multiple-choice | —Unverified | 0 |
| Unsupervised multiple choices question answering via universal corpus | Feb 27, 2024 | FormKnowledge Graphs | —Unverified | 0 |