| SocialIQA: Commonsense Reasoning about Social Interactions | Apr 22, 2019 | Common Sense ReasoningCoreference Resolution | CodeCode Available | 0 |
| Questioning the Survey Responses of Large Language Models | Jun 13, 2023 | Multiple-choiceSurvey | CodeCode Available | 0 |
| Exposing the Limits of Video-Text Models through Contrast Sets | Jul 1, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Extracting Keywords from Open-Ended Business Survey Questions | Aug 31, 2018 | Multiple-choiceSurvey | CodeCode Available | 0 |
| Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering | Feb 16, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering | Aug 28, 2018 | AI2 Reasoning ChallengeARC | CodeCode Available | 0 |
| Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor | Dec 8, 2024 | MisconceptionsMultiple-choice | CodeCode Available | 0 |
| Learning to Reuse Distractors to support Multiple Choice Question Generation in Education | Oct 25, 2022 | Multiple-choiceQuestion Generation | CodeCode Available | 0 |
| "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models | Feb 22, 2024 | Multiple-choiceText Generation | CodeCode Available | 0 |
| Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A | Feb 20, 2024 | Language ModellingLarge Language Model | CodeCode Available | 0 |
| FAT ALBERT: Finding Answers in Large Texts using Semantic Similarity Attention Layer based on BERT | Aug 22, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers | Oct 15, 2024 | Multiple-choice | CodeCode Available | 0 |
| TRACE: Transformer-based Risk Assessment for Clinical Evaluation | Nov 13, 2024 | Decision MakingMissing Values | CodeCode Available | 0 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 |
| Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores | Feb 22, 2025 | Distractor GenerationInformation Retrieval | CodeCode Available | 0 |
| Length Optimization in Conformal Prediction | Jun 27, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 0 |
| FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework | Apr 9, 2021 | Language ModellingMultiple-choice | CodeCode Available | 0 |
| Training-free LLM Merging for Multi-task Learning | Jun 14, 2025 | Multiple-choiceMulti-Task Learning | CodeCode Available | 0 |
| Solving and Generating NPR Sunday Puzzles with Large Language Models | Jun 21, 2023 | Multiple-choicePrompt Engineering | CodeCode Available | 0 |
| HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models | Feb 9, 2025 | Answer GenerationLanguage Modeling | CodeCode Available | 0 |
| Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings | Jan 15, 2024 | Knowledge Graph EmbeddingsKnowledge Graphs | CodeCode Available | 0 |
| UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions | Apr 20, 2024 | Data AugmentationMultiple-choice | CodeCode Available | 0 |
| Solving Machine Learning Problems | Jul 2, 2021 | BIG-bench Machine LearningData Augmentation | CodeCode Available | 0 |
| Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? | Jul 12, 2024 | Logical ReasoningMultiple-choice | CodeCode Available | 0 |
| Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Jan 10, 2025 | Automatic Speech RecognitionClassification | CodeCode Available | 0 |