| Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty? | Jul 7, 2024 | Multiple-choice | CodeCode Available | 0 |
| Are Large Language Models Consistent over Value-laden Questions? | Jul 3, 2024 | Multiple-choice | CodeCode Available | 0 |
| Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? | Jul 2, 2024 | Graph MiningLanguage Modeling | CodeCode Available | 0 |
| CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models | Jul 2, 2024 | Multiple-choice | —Unverified | 0 |
| DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions | Jun 27, 2024 | Distractor GenerationMath | CodeCode Available | 0 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 |
| VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation | Jun 25, 2024 | ARCBenchmarking | CodeCode Available | 0 |
| Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration | Jun 24, 2024 | DiversityMultiple-choice | —Unverified | 0 |
| SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages | Jun 20, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |