| A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation using GPT | Jan 13, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 |
| Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions | Oct 3, 2023 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation | Jun 13, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind | May 24, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| CLOMO: Counterfactual Logical Modification with Large Language Models | Nov 29, 2023 | counterfactualCounterfactual Reasoning | CodeCode Available | 0 |
| IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMs | Nov 12, 2024 | coreference-resolutionCoreference Resolution | CodeCode Available | 0 |
| DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence? | Jun 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security | Dec 26, 2023 | Computer SecurityMultiple-choice | CodeCode Available | 0 |
| What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks? | Jun 1, 2021 | Multiple-choiceNatural Language Understanding | CodeCode Available | 0 |
| Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods | Jul 16, 2023 | Multiple-choice | CodeCode Available | 0 |
| TAXI: Evaluating Categorical Knowledge Editing for Language Models | Apr 23, 2024 | knowledge editingMultiple-choice | CodeCode Available | 0 |
| WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging | Feb 25, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| What Makes Reading Comprehension Questions Easier? | Aug 28, 2018 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 0 |
| Downstream Trade-offs of a Family of Text Watermarks | Nov 16, 2023 | FormLanguage Modelling | CodeCode Available | 0 |
| Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches | May 18, 2025 | FairnessMemorization | CodeCode Available | 0 |
| A multimodal dataset for understanding the impact of mobile phones on remote online virtual education | Dec 13, 2024 | EEGHead Pose Estimation | CodeCode Available | 0 |
| Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning | Oct 9, 2024 | HallucinationMultiple-choice | CodeCode Available | 0 |
| Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty? | Jul 7, 2024 | Multiple-choice | CodeCode Available | 0 |
| Differentiating Choices via Commonality for Multiple-Choice Question Answering | Aug 21, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 |
| Utilizing Background Knowledge for Robust Reasoning over Traffic Situations | Dec 4, 2022 | Knowledge GraphsMultiple-choice | CodeCode Available | 0 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 |
| Improving Machine Reading Comprehension with General Reading Strategies | Oct 31, 2018 | ARCLanguage Modeling | CodeCode Available | 0 |
| A large language model-assisted education tool to provide feedback on open-ended responses | Jul 25, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking | Sep 26, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 |
| Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction | Mar 5, 2025 | In-Context LearningMultiple-choice | CodeCode Available | 0 |