| Improving the Production Efficiency and Well-formedness of Automatically-Generated Multiple-Choice Cloze Vocabulary Questions | May 1, 2020 | Multiple-choice | —Unverified | 0 |
| In Case You Missed It: ARC 'Challenge' Is Not That Challenging | Dec 23, 2024 | ARCMultiple-choice | —Unverified | 0 |
| TVBench: Redesigning Video-Language Evaluation | Oct 10, 2024 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| Indirect Identification of Psychosocial Risks from Natural Language | Apr 30, 2020 | Multiple-choiceTopic Models | —Unverified | 0 |
| Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection | Jan 28, 2025 | Multiple-choice | —Unverified | 0 |
| Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension Questions | Oct 19, 2022 | Language ModelingLanguage Modelling | —Unverified | 0 |
| InnerThoughts: Disentangling Representations and Predictions in Large Language Models | Jan 29, 2025 | Multiple-choicePosition | —Unverified | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs | Jun 13, 2025 | Medical Question AnsweringMedQA | —Unverified | 0 |
| Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh | Feb 19, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 |