| Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models | May 30, 2025 | MathMultiple-choice | CodeCode Available | 0 |
| Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation | May 30, 2025 | Continual PretrainingFairness | CodeCode Available | 0 |
| TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine | May 29, 2025 | DiagnosticMultiple-choice | —Unverified | 0 |
| DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors | May 29, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence | May 29, 2025 | Multiple-choiceSpatial Reasoning | —Unverified | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning | May 29, 2025 | Anomaly DetectionDescriptive | CodeCode Available | 2 |
| Image Aesthetic Reasoning: A New Benchmark for Medical Image Screening with MLLMs | May 29, 2025 | Image GenerationMultiple-choice | —Unverified | 0 |
| Large Language Models Often Know When They Are Being Evaluated | May 28, 2025 | MMLUMultiple-choice | —Unverified | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |