| Improving Question Answering with External Knowledge | Feb 3, 2019 | ARCMultiple-choice | CodeCode Available | 0 | 5 |
| Introducing a framework to assess newly created questions with Natural Language Processing | Apr 28, 2020 | Multiple-choice | CodeCode Available | 0 | 5 |
| CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models | Jun 7, 2024 | Multiple-choicePhilosophy | CodeCode Available | 0 | 5 |
| Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning | Oct 9, 2024 | HallucinationMultiple-choice | CodeCode Available | 0 | 5 |
| How much do LLMs learn from negative examples? | Mar 18, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 | 5 |
| Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy | May 24, 2023 | In-Context LearningMultiple-choice | CodeCode Available | 0 | 5 |
| How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making? | Oct 21, 2024 | counterfactualDecision Making | CodeCode Available | 0 | 5 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | May 13, 2025 | FormMultiple-choice | CodeCode Available | 0 | 5 |
| IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMs | Nov 12, 2024 | coreference-resolutionCoreference Resolution | CodeCode Available | 0 | 5 |
| Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors | May 2, 2025 | High School PhysicsMisconceptions | CodeCode Available | 0 | 5 |