| It's Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning | Nov 13, 2023 | Multiple-choice | CodeCode Available | 0 | 5 |
| Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales | Oct 2, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models | Apr 11, 2024 | Multiple-choiceReading Comprehension | CodeCode Available | 0 | 5 |
| Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning | Feb 8, 2025 | Legal ReasoningMultiple-choice | CodeCode Available | 0 | 5 |
| DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence? | Jun 18, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Introducing a framework to assess newly created questions with Natural Language Processing | Apr 28, 2020 | Multiple-choice | CodeCode Available | 0 | 5 |
| IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models | Jun 18, 2024 | ManagementMultiple-choice | CodeCode Available | 0 | 5 |
| Self-Recognition in Language Models | Jul 9, 2024 | Multiple-choice | CodeCode Available | 0 | 5 |
| LLaVA-OneVision: Easy Visual Task Transfer | Aug 6, 2024 | 3D Question Answering (3D-QA) | CodeCode Available | 0 | 5 |
| Improving Question Answering with External Knowledge | Feb 3, 2019 | ARCMultiple-choice | CodeCode Available | 0 | 5 |