| Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs | Jun 12, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs | Jun 11, 2025 | Multiple-choice | —Unverified | 0 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Jun 10, 2025 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | Jun 9, 2025 | DescriptiveForm | —Unverified | 0 |
| Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth | Jun 8, 2025 | Multiple-choice | —Unverified | 0 |
| Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms | Jun 5, 2025 | Multiple-choice | —Unverified | 0 |
| Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights | Jun 5, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales | Jun 4, 2025 | Multiple-choice | —Unverified | 0 |
| Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis | Jun 3, 2025 | Multiple-choice | —Unverified | 0 |
| Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation | Jun 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |