| LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception | Apr 21, 2025 | MathMMLU | —Unverified | 0 |
| Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Apr 20, 2025 | Autonomous DrivingImage Captioning | CodeCode Available | 0 |
| FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models | Apr 20, 2025 | DescriptiveEthics | —Unverified | 0 |
| Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment | Apr 19, 2025 | ClassificationMultiple-choice | —Unverified | 0 |
| DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain | Apr 18, 2025 | Multiple-choice | —Unverified | 0 |
| D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model | Apr 18, 2025 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items | Apr 15, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark | Apr 14, 2025 | ManagementMultiple-choice | —Unverified | 0 |
| Large Language Models Could Be Rote Learners | Apr 11, 2025 | MemorizationMMLU | —Unverified | 0 |
| Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Apr 9, 2025 | Multiple-choice | CodeCode Available | 0 |