| So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection | May 24, 2025 | BenchmarkingImage Forgery Detection | —Unverified | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 |
| Benchmark for Antibody Binding Affinity Maturation and Design | May 23, 2025 | Benchmarking | —Unverified | 0 |
| U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | May 23, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| 3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method Evaluation | May 23, 2025 | 3D Face ReconstructionBenchmarking | CodeCode Available | 0 |
| A Position Paper on the Automatic Generation of Machine Learning Leaderboards | May 23, 2025 | BenchmarkingPosition | CodeCode Available | 0 |
| SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification | May 23, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints | May 23, 2025 | Benchmarking | —Unverified | 0 |
| PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language | May 23, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow | May 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |