| The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory | Mar 13, 2025 | MathMultiple-choice | —Unverified | 0 | 0 |
| A Novel Approach for Constrained Optimization in Graphical Models | Dec 1, 2020 | Multiple-choice | —Unverified | 0 | 0 |
| BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles | Sep 23, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own | Feb 23, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| BLINK: Multimodal Large Language Models Can See but Not Perceive | Apr 18, 2024 | Depth EstimationMultiple-choice | —Unverified | 0 | 0 |
| An MRC Framework for Semantic Role Labeling | Jan 16, 2022 | Computational EfficiencyMachine Reading Comprehension | —Unverified | 0 | 0 |
| BloomVQA: Assessing Hierarchical Multi-modal Comprehension | Dec 20, 2023 | Data AugmentationMemorization | —Unverified | 0 | 0 |
| The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs | Feb 6, 2025 | Multiple-choiceSensitivity | —Unverified | 0 | 0 |
| The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines? | Sep 3, 2024 | Multiple-choiceQuestion Generation | —Unverified | 0 | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 | 0 |