| Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions | Oct 3, 2023 | MisconceptionsMultiple-choice | CodeCode Available | 0 | 5 |
| KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting | Dec 1, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? | Jul 12, 2024 | Logical ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking | Sep 26, 2024 | Distractor GenerationMultiple-choice | CodeCode Available | 0 | 5 |
| Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty? | Jul 7, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings | Jan 15, 2024 | Knowledge Graph EmbeddingsKnowledge Graphs | CodeCode Available | 0 | 5 |
| LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models | Oct 13, 2024 | HallucinationHallucination Evaluation | CodeCode Available | 0 | 5 |
| Automating Turkish Educational Quiz Generation Using Large Language Models | Jun 5, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs | Oct 15, 2024 | Image DescriptionMultiple-choice | CodeCode Available | 0 | 5 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 | 5 |
| Differentiating Choices via Commonality for Multiple-Choice Question Answering | Aug 21, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | CodeCode Available | 0 | 5 |
| A large language model-assisted education tool to provide feedback on open-ended responses | Jul 25, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers | Oct 15, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 | 5 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 | 5 |
| CASE: Commonsense-Augmented Score with an Expanded Answer Space | Nov 3, 2023 | Multiple-choice | CodeCode Available | 0 | 5 |
| Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering | Aug 28, 2018 | AI2 Reasoning ChallengeARC | CodeCode Available | 0 | 5 |
| Learning to Reuse Distractors to support Multiple Choice Question Generation in Education | Oct 25, 2022 | Multiple-choiceQuestion Generation | CodeCode Available | 0 | 5 |
| Language Models as Knowledge Bases for Visual Word Sense Disambiguation | Oct 3, 2023 | Image CaptioningMultiple-choice | CodeCode Available | 0 | 5 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 | 5 |
| KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models | Oct 15, 2023 | Multiple-choiceTriplet | CodeCode Available | 0 | 5 |
| Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs | Jan 10, 2025 | Multiple-choice | CodeCode Available | 0 | 5 |
| Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models | Apr 11, 2024 | Multiple-choiceReading Comprehension | CodeCode Available | 0 | 5 |
| Joint Learning of Sentence Embeddings for Relevance and Entailment | May 16, 2016 | Decision MakingInformation Retrieval | CodeCode Available | 0 | 5 |
| DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence? | Jun 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Are Large Language Models Consistent over Value-laden Questions? | Jul 3, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A | Feb 20, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning | Nov 13, 2023 | Multiple-choice | CodeCode Available | 0 | 5 |
| Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Jan 10, 2025 | Automatic Speech RecognitionClassification | CodeCode Available | 0 | 5 |
| Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Apr 9, 2025 | Multiple-choice | CodeCode Available | 0 | 5 |
| HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models | Feb 9, 2025 | Answer GenerationLanguage Modeling | CodeCode Available | 0 | 5 |
| IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models | Jun 18, 2024 | ManagementMultiple-choice | CodeCode Available | 0 | 5 |
| Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning | Feb 8, 2025 | Legal ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers | May 25, 2024 | Common Sense ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation | Jun 13, 2024 | BenchmarkingHallucination | CodeCode Available | 0 | 5 |
| Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions | May 6, 2024 | Decision MakingMultiple-choice | CodeCode Available | 0 | 5 |
| StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding | Oct 19, 2023 | Multiple-choiceNatural Language Understanding | CodeCode Available | 0 | 5 |
| Introducing a framework to assess newly created questions with Natural Language Processing | Apr 28, 2020 | Multiple-choice | CodeCode Available | 0 | 5 |
| DE-COP: Detecting Copyrighted Content in Language Models Training Data | Feb 15, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| An Automatic Question Usability Evaluation Toolkit | May 30, 2024 | Multiple-choiceWord Embeddings | CodeCode Available | 0 | 5 |
| Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales | Oct 2, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? | Jul 2, 2024 | Graph MiningLanguage Modeling | CodeCode Available | 0 | 5 |
| A Profit-Maximizing Strategy for Advertising on the e-Commerce Platforms | Oct 31, 2022 | ManagementMultiple-choice | CodeCode Available | 0 | 5 |
| Fusing Models with Complementary Expertise | Oct 2, 2023 | Multiple-choicetext-classification | CodeCode Available | 0 | 5 |
| TAXI: Evaluating Categorical Knowledge Editing for Language Models | Apr 23, 2024 | knowledge editingMultiple-choice | CodeCode Available | 0 | 5 |
| Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions | May 30, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 | 5 |
| Chance-Constrained Multiple-Choice Knapsack Problem: Model, Algorithms, and Applications | Jun 26, 2023 | Combinatorial OptimizationMultiple-choice | CodeCode Available | 0 | 5 |
| Improving Question Answering with External Knowledge | Feb 3, 2019 | ARCMultiple-choice | CodeCode Available | 0 | 5 |
| DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine | Nov 14, 2024 | FormHallucination | CodeCode Available | 0 | 5 |