| A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding | Jun 8, 2024 | DescriptiveLanguage Modelling | CodeCode Available | 1 |
| LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs | Jun 7, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 0 |
| CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models | Jun 7, 2024 | Multiple-choicePhilosophy | CodeCode Available | 0 |
| M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering | Jun 6, 2024 | abstractive question answeringClinical Knowledge | CodeCode Available | 0 |
| Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? | Jun 6, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Every Answer Matters: Evaluating Commonsense with Probabilistic Measures | Jun 6, 2024 | Common Sense ReasoningLanguage Modeling | CodeCode Available | 0 |
| Automating Turkish Educational Quiz Generation Using Large Language Models | Jun 5, 2024 | Multiple-choice | CodeCode Available | 0 |
| Order-Independence Without Fine Tuning | Jun 4, 2024 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| TopViewRS: Vision-Language Models as Top-View Spatial Reasoners | Jun 4, 2024 | Multiple-choiceSpatial Reasoning | CodeCode Available | 1 |
| Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data | Jun 4, 2024 | Clinical KnowledgeMultiple-choice | CodeCode Available | 0 |