| The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations | Jul 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models | Jul 17, 2025 | Multiple-choice | —Unverified | 0 |
| MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks | Jul 3, 2025 | FairnessMultiple-choice | —Unverified | 0 |
| Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III | Jun 29, 2025 | Model SelectionMultiple-choice | —Unverified | 0 |
| OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs | Jun 26, 2025 | DiversityMultiple-choice | —Unverified | 0 |
| Adapting Vision-Language Models for Evaluating World Models | Jun 22, 2025 | Action RecognitionMultimodal Reasoning | —Unverified | 0 |
| PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models | Jun 21, 2025 | Mathematical ReasoningMultiple-choice | —Unverified | 0 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | Jun 18, 2025 | document understandingMultiple-choice | —Unverified | 0 |
| Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings | Jun 17, 2025 | Decision MakingLanguage Modeling | —Unverified | 0 |
| Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding | Jun 17, 2025 | Multiple-choiceNatural Language Inference | —Unverified | 0 |
| Training-free LLM Merging for Multi-task Learning | Jun 14, 2025 | Multiple-choiceMulti-Task Learning | CodeCode Available | 0 |
| Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs | Jun 13, 2025 | Medical Question AnsweringMedQA | —Unverified | 0 |
| Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs | Jun 12, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs | Jun 11, 2025 | Multiple-choice | —Unverified | 0 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Jun 10, 2025 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | Jun 9, 2025 | DescriptiveForm | —Unverified | 0 |
| Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth | Jun 8, 2025 | Multiple-choice | —Unverified | 0 |
| STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving | Jun 6, 2025 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights | Jun 5, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms | Jun 5, 2025 | Multiple-choice | —Unverified | 0 |
| LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation | Jun 4, 2025 | Multiple-choice | CodeCode Available | 1 |
| Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales | Jun 4, 2025 | Multiple-choice | —Unverified | 0 |
| Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis | Jun 3, 2025 | Multiple-choice | —Unverified | 0 |
| Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation | Jun 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |