| Large Language Models Are Not Robust Multiple Choice Selectors | Sep 7, 2023 | Computational EfficiencyMultiple-choice | CodeCode Available | 1 |
| An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models | Sep 5, 2023 | Multiple-choice | —Unverified | 0 |
| CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models | Sep 5, 2023 | Code GenerationMultiple-choice | CodeCode Available | 1 |
| INCEPTNET: Precise And Early Disease Detection Application For Medical Images Analyses | Sep 5, 2023 | Cell DetectionLesion Segmentation | CodeCode Available | 0 |
| Generalised Winograd Schema and its Contextuality | Aug 31, 2023 | coreference-resolutionCoreference Resolution | —Unverified | 0 |
| The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants | Aug 31, 2023 | BelebeleCross-Lingual Transfer | CodeCode Available | 2 |
| Spoken Language Intelligence of Large Language Models for Language Learning | Aug 28, 2023 | Language AcquisitionMultiple-choice | CodeCode Available | 0 |
| Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions | Aug 22, 2023 | Multiple-choiceSensitivity | —Unverified | 0 |
| LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models | Aug 20, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models | Aug 19, 2023 | Multiple-choice | CodeCode Available | 2 |