| LLM-based Text Simplification and its Effect on User Comprehension and Cognitive Load | May 4, 2025 | ArticlesMultiple-choice | —Unverified | 0 |
| LookAlike: Consistent Distractor Generation in Math MCQs | May 3, 2025 | Distractor GenerationMath | —Unverified | 0 |
| Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory | May 2, 2025 | Multiple-choice | —Unverified | 0 |
| Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors | May 2, 2025 | High School PhysicsMisconceptions | CodeCode Available | 0 |
| SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning | Apr 22, 2025 | Multiple-choicereinforcement-learning | —Unverified | 0 |
| LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception | Apr 21, 2025 | MathMMLU | —Unverified | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 |
| Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment | Apr 19, 2025 | ClassificationMultiple-choice | —Unverified | 0 |
| DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain | Apr 18, 2025 | Multiple-choice | —Unverified | 0 |
| D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model | Apr 18, 2025 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items | Apr 15, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark | Apr 14, 2025 | ManagementMultiple-choice | —Unverified | 0 |
| Large Language Models Could Be Rote Learners | Apr 11, 2025 | MemorizationMMLU | —Unverified | 0 |
| Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Apr 9, 2025 | Multiple-choice | CodeCode Available | 0 |
| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams | Apr 4, 2025 | BenchmarkingManagement | —Unverified | 0 |
| From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models | Apr 4, 2025 | Multiple-choice | —Unverified | 0 |
| VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence | Apr 3, 2025 | Multiple-choice | CodeCode Available | 0 |
| ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning | Mar 31, 2025 | Multiple-choice | —Unverified | 0 |
| Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models | Mar 30, 2025 | Knowledge GraphsMultiple-choice | CodeCode Available | 0 |
| Order Independence With Finetuning | Mar 30, 2025 | ARCLanguage Modeling | —Unverified | 0 |
| Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering | Mar 23, 2025 | BenchmarkingChart Question Answering | —Unverified | 0 |
| Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark | Mar 22, 2025 | Multiple-choice | —Unverified | 0 |
| SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia | Mar 21, 2025 | Multiple-choice | —Unverified | 0 |