| Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam | Aug 1, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods | Mar 1, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability | Nov 10, 2024 | Multiple-choiceText Generation | —Unverified | 0 | 0 |
| Prompt Engineering and Calibration for Zero-Shot Commonsense Reasoning | Apr 14, 2023 | Multiple-choicePrompt Engineering | —Unverified | 0 | 0 |
| Prompting Implicit Discourse Relation Annotation | Feb 7, 2024 | ClassificationImplicit Discourse Relation Classification | —Unverified | 0 | 0 |
| Instruction Fine-Tuning: Does Prompt Loss Matter? | Jan 24, 2024 | Multiple-choicetoken-classification | —Unverified | 0 | 0 |
| ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding | Nov 7, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology | Nov 16, 2023 | MMLUMultiple-choice | —Unverified | 0 | 0 |
| PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities | Jan 13, 2024 | Instruction FollowingMultiple-choice | —Unverified | 0 | 0 |
| Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs | Sep 30, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |