| An MRC Framework for Semantic Role Labeling | Jan 16, 2022 | Computational EfficiencyMachine Reading Comprehension | —Unverified | 0 | 0 |
| BloomVQA: Assessing Hierarchical Multi-modal Comprehension | Dec 20, 2023 | Data AugmentationMemorization | —Unverified | 0 | 0 |
| The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs | Feb 6, 2025 | Multiple-choiceSensitivity | —Unverified | 0 | 0 |
| The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines? | Sep 3, 2024 | Multiple-choiceQuestion Generation | —Unverified | 0 | 0 |
| Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs | Feb 12, 2025 | Multiple-choiceSurvey | —Unverified | 0 | 0 |
| The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts | Feb 3, 2025 | Multiple-choiceReading Comprehension | —Unverified | 0 | 0 |
| Bridging Information-Seeking Human Gaze and Machine Reading Comprehension | Sep 30, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| Bridging the Language Gap: Knowledge Injected Multilingual Question Answering | Apr 6, 2023 | Cross-Lingual TransferExtractive Question-Answering | —Unverified | 0 | 0 |
| Analysis of the Cambridge Multiple-Choice Questions Reading Dataset with a Focus on Candidate Response Distribution | Jun 22, 2023 | Multiple-choice | —Unverified | 0 | 0 |
| Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams | Apr 4, 2025 | BenchmarkingManagement | —Unverified | 0 | 0 |
| Can ChatGPT pass the Vietnamese National High School Graduation Examination? | Jun 15, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Can Crowdsourcing be used for Effective Annotation of Arabic? | May 1, 2014 | Entity ResolutionMultiple-choice | —Unverified | 0 | 0 |
| Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? | Mar 16, 2023 | Multiple-choice | —Unverified | 0 | 0 |
| The use of large language models to enhance cancer clinical trial educational materials | Dec 2, 2024 | MisinformationMultiple-choice | —Unverified | 0 | 0 |
| Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! | Jan 18, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer | May 27, 2024 | Multiple-choiceSentiment Analysis | —Unverified | 0 | 0 |
| CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy | Oct 17, 2024 | Multiple-choiceResponse Generation | —Unverified | 0 | 0 |
| ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising | Dec 9, 2024 | Multiple-choiceMulti-Task Learning | —Unverified | 0 | 0 |
| CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models | Jul 2, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding | Dec 16, 2024 | HallucinationMultiple-choice | —Unverified | 0 | 0 |
| Changing Answer Order Can Decrease MMLU Accuracy | Jun 27, 2024 | MMLUMultiple-choice | —Unverified | 0 | 0 |
| Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks | Nov 9, 2023 | Multiple-choiceWorld Knowledge | —Unverified | 0 | 0 |
| What Makes Reading Comprehension Questions Difficult? Investigating Variation in Passage Sources and Question Types | Sep 17, 2021 | Logical ReasoningMultiple-choice | —Unverified | 0 | 0 |
| Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data | Mar 13, 2025 | Large Language ModelMath | —Unverified | 0 | 0 |
| An Improved Traditional Chinese Evaluation Suite for Foundation Model | Mar 4, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |