| LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models | Mar 19, 2024 | Multiple-choice | —Unverified | 0 |
| Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering | Mar 17, 2024 | Event Causality IdentificationMultiple-choice | —Unverified | 0 |
| Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models | Mar 15, 2024 | Few-Shot Image Classificationimage-classification | —Unverified | 0 |
| EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models | Mar 15, 2024 | MiscellaneousMultiple-choice | CodeCode Available | 0 |
| Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings | Mar 14, 2024 | Multiple-choiceTime Series | CodeCode Available | 0 |
| Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge | Mar 14, 2024 | Multiple-choice | —Unverified | 0 |
| AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic | Mar 14, 2024 | EthicsMultiple-choice | —Unverified | 0 |
| Rethinking Generative Large Language Model Evaluation for Semantic Comprehension | Mar 12, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding | Mar 11, 2024 | Dialogue GenerationMultiple-choice | —Unverified | 0 |
| Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5 | Mar 4, 2024 | Multiple-choicePart-Of-Speech Tagging | —Unverified | 0 |
| An Improved Traditional Chinese Evaluation Suite for Foundation Model | Mar 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations | Mar 3, 2024 | MedQAMMLU | —Unverified | 0 |
| Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment | Mar 3, 2024 | Cloze TestMultiple-choice | —Unverified | 0 |
| Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods | Mar 1, 2024 | Multiple-choice | —Unverified | 0 |
| Unsupervised multiple choices question answering via universal corpus | Feb 27, 2024 | FormKnowledge Graphs | —Unverified | 0 |
| Biomedical Entity Linking as Multiple Choice Question Answering | Feb 23, 2024 | Entity LinkingMultiple-choice | CodeCode Available | 0 |
| "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models | Feb 22, 2024 | Multiple-choiceText Generation | CodeCode Available | 0 |
| Identifying Multiple Personalities in Large Language Models with External Evaluation | Feb 22, 2024 | Multiple-choice | —Unverified | 0 |
| Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models | Feb 21, 2024 | Multiple-choice | —Unverified | 0 |
| Ranking Large Language Models without Ground Truth | Feb 21, 2024 | Multiple-choiceTriplet | —Unverified | 0 |
| KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge | Feb 21, 2024 | 4kMultiple-choice | —Unverified | 0 |
| Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A | Feb 20, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities | Feb 20, 2024 | Multiple-choiceText Simplification | —Unverified | 0 |
| Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? | Feb 19, 2024 | Decision MakingMemorization | CodeCode Available | 0 |
| Stick to your Role! Stability of Personal Values Expressed in Large Language Models | Feb 19, 2024 | Multiple-choice | —Unverified | 0 |