| SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models | May 24, 2025 | BenchmarkingVideo Grounding | —Unverified | 0 |
| A Position Paper on the Automatic Generation of Machine Learning Leaderboards | May 23, 2025 | BenchmarkingPosition | CodeCode Available | 0 |
| PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language | May 23, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | May 23, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints | May 23, 2025 | Benchmarking | —Unverified | 0 |
| Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts | May 23, 2025 | Benchmarking | —Unverified | 0 |
| U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | May 23, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Benchmark for Antibody Binding Affinity Maturation and Design | May 23, 2025 | Benchmarking | —Unverified | 0 |
| SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification | May 23, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 |