| LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? | May 18, 2025 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 1 |
| IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation | May 16, 2025 | Multiple-choice | CodeCode Available | 1 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training | May 16, 2025 | Multiple-choicetext-classification | —Unverified | 0 |
| GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing | May 16, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |
| Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | May 16, 2025 | Cross-Modal RetrievalDiagnostic | CodeCode Available | 2 |
| Ranked Voting based Self-Consistency of Large Language Models | May 16, 2025 | Multiple-choiceOpen-Ended Question Answering | CodeCode Available | 1 |
| Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation | May 15, 2025 | InformativenessMultiple-choice | —Unverified | 0 |
| The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think | May 15, 2025 | Multiple-choice | —Unverified | 0 |
| KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning | May 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |