| Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers | Oct 15, 2024 | Multiple-choice | CodeCode Available | 0 |
| Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing | Oct 14, 2024 | AllBinary Classification | —Unverified | 0 |
| Personalised Feedback Framework for Online Education Programmes Using Generative AI | Oct 14, 2024 | BenchmarkingManagement | —Unverified | 0 |
| MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Oct 14, 2024 | Multiple-choice | CodeCode Available | 1 |
| LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models | Oct 13, 2024 | Multiple-choice | —Unverified | 0 |
| LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models | Oct 13, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 0 |
| Taming Overconfidence in LLMs: Reward Calibration in RLHF | Oct 13, 2024 | Multiple-choice | CodeCode Available | 1 |
| The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models | Oct 12, 2024 | MisconceptionsMultiple-choice | —Unverified | 0 |
| SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models | Oct 11, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 1 |
| NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models | Oct 11, 2024 | Multiple-choiceTruthfulQA | CodeCode Available | 0 |