| Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding | Jun 17, 2025 | Multiple-choiceNatural Language Inference | —Unverified | 0 |
| Training-free LLM Merging for Multi-task Learning | Jun 14, 2025 | Multiple-choiceMulti-Task Learning | CodeCode Available | 0 |
| Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs | Jun 13, 2025 | Medical Question AnsweringMedQA | —Unverified | 0 |
| Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs | Jun 12, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs | Jun 11, 2025 | Multiple-choice | —Unverified | 0 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Jun 10, 2025 | Multiple-choiceOpen-Ended Question Answering | —Unverified | 0 |
| ARGUS: Hallucination and Omission Evaluation in Video-LLMs | Jun 9, 2025 | DescriptiveForm | —Unverified | 0 |
| Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth | Jun 8, 2025 | Multiple-choice | —Unverified | 0 |
| STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving | Jun 6, 2025 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights | Jun 5, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |