| Multiple-Choice Question Generation Using Large Language Models: Methodology and Educator Insights | Jun 5, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms | Jun 5, 2025 | Multiple-choice | —Unverified | 0 |
| Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales | Jun 4, 2025 | Multiple-choice | —Unverified | 0 |
| Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis | Jun 3, 2025 | Multiple-choice | —Unverified | 0 |
| Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation | Jun 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases | May 30, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 |
| Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization | May 30, 2025 | FormLanguage Modeling | —Unverified | 0 |
| Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation | May 30, 2025 | Continual PretrainingFairness | CodeCode Available | 0 |
| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 |
| Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models | May 30, 2025 | MathMultiple-choice | CodeCode Available | 0 |
| PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain | May 30, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 |
| MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence | May 29, 2025 | Multiple-choiceSpatial Reasoning | —Unverified | 0 |
| TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine | May 29, 2025 | DiagnosticMultiple-choice | —Unverified | 0 |
| DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | May 29, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs | May 29, 2025 | Image GenerationMultiple-choice | —Unverified | 0 |
| Large Language Models Often Know When They Are Being Evaluated | May 28, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions | May 26, 2025 | Multiple-choice | —Unverified | 0 |
| My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals | May 26, 2025 | EthicsFairness | —Unverified | 0 |
| DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response | May 26, 2025 | Multiple-choice | —Unverified | 0 |
| CP-Router: An Uncertainty-Aware Router Between LLM and LRM | May 26, 2025 | Conformal PredictionLogical Reasoning | —Unverified | 0 |
| BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | May 25, 2025 | General KnowledgeMMLU | CodeCode Available | 0 |
| Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning | May 24, 2025 | Multiple-choicePrompt Engineering | —Unverified | 0 |
| Collaboration among Multiple Large Language Models for Medical Question Answering | May 22, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 |