| Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory | May 2, 2025 | Multiple-choice | —Unverified | 0 |
| Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks | Nov 9, 2023 | Multiple-choiceWorld Knowledge | —Unverified | 0 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 |
| Evaluating Question Answering Evaluation | Nov 1, 2019 | Answer GenerationMultiple-choice | —Unverified | 0 |
| Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation | May 15, 2025 | InformativenessMultiple-choice | —Unverified | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 |
| Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem | Feb 13, 2013 | Information RetrievalMultiple-choice | —Unverified | 0 |
| CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models | Jul 2, 2024 | Multiple-choice | —Unverified | 0 |
| Evaluating multiple large language models in pediatric ophthalmology | Nov 7, 2023 | Multiple-choice | —Unverified | 0 |
| CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy | Oct 17, 2024 | Multiple-choiceResponse Generation | —Unverified | 0 |