| Automating question generation from educational text | Sep 26, 2023 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| HANS, are you clever? Clever Hans Effect Analysis of Neural Systems | Sep 21, 2023 | Decision MakingMultiple-choice | —Unverified | 0 |
| Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models | Sep 19, 2023 | Explanation GenerationLanguage Modelling | CodeCode Available | 0 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change | Sep 19, 2023 | Generative Question AnsweringInformation Retrieval | —Unverified | 0 |
| Language models are susceptible to incorrect patient self-diagnosis in medical applications | Sep 17, 2023 | DiagnosticMultiple-choice | —Unverified | 0 |
| Self-Assessment Tests are Unreliable Measures of LLM Personality | Sep 15, 2023 | Multiple-choice | —Unverified | 0 |
| SafetyBench: Evaluating the Safety of Large Language Models | Sep 13, 2023 | Multiple-choice | CodeCode Available | 2 |
| Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions | Sep 12, 2023 | Multiple-choiceSentence | —Unverified | 0 |
| Use neural networks to recognize students' handwritten letters and incorrect symbols | Sep 12, 2023 | Multiple-choice | —Unverified | 0 |