| InstructionBench: An Instructional Video Understanding Benchmark | Apr 7, 2025 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams | Apr 4, 2025 | BenchmarkingManagement | —Unverified | 0 |
| From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models | Apr 4, 2025 | Multiple-choice | —Unverified | 0 |
| VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence | Apr 3, 2025 | Multiple-choice | CodeCode Available | 0 |
| ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning | Mar 31, 2025 | Multiple-choice | —Unverified | 0 |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | Mar 31, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 2 |
| Order Independence With Finetuning | Mar 30, 2025 | ARCLanguage Modeling | —Unverified | 0 |
| Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models | Mar 30, 2025 | Knowledge GraphsMultiple-choice | CodeCode Available | 0 |
| Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark | Mar 26, 2025 | MMLUMultiple-choice | CodeCode Available | 1 |
| Language Model Uncertainty Quantification with Attention Chain | Mar 24, 2025 | Computational EfficiencyLanguage Modeling | CodeCode Available | 1 |
| Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering | Mar 23, 2025 | BenchmarkingChart Question Answering | —Unverified | 0 |
| Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark | Mar 22, 2025 | Multiple-choice | —Unverified | 0 |
| SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia | Mar 21, 2025 | Multiple-choice | —Unverified | 0 |
| Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation | Mar 20, 2025 | Multiple-choiceText Generation | CodeCode Available | 0 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models | Mar 20, 2025 | Autonomous DrivingMultiple-choice | —Unverified | 0 |
| CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models | Mar 20, 2025 | Code GenerationMultiple-choice | —Unverified | 0 |
| VisNumBench: Evaluating Number Sense of Multimodal Large Language Models | Mar 19, 2025 | Multiple-choice | —Unverified | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| How much do LLMs learn from negative examples? | Mar 18, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 0 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| LEAVS: An LLM-based Labeler for Abdominal CT Supervision | Mar 17, 2025 | AnatomyLarge Language Model | CodeCode Available | 0 |
| It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education | Mar 13, 2025 | Multiple-choice | —Unverified | 0 |
| Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data | Mar 13, 2025 | Large Language ModelMath | —Unverified | 0 |
| The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory | Mar 13, 2025 | MathMultiple-choice | —Unverified | 0 |