| GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation | May 17, 2025 | Benchmarking | —Unverified | 0 |
| SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds | May 17, 2025 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems | May 16, 2025 | BenchmarkingMixture-of-Experts | —Unverified | 0 |
| GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents | May 16, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| Benchmarking CFAR and CNN-based Peak Detection Algorithms in ISAC under Hardware Impairments | May 16, 2025 | BenchmarkingIntegrated sensing and communication | —Unverified | 0 |
| Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models | May 16, 2025 | Benchmarking | —Unverified | 0 |
| Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese | May 16, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning Tasks | May 16, 2025 | BenchmarkingLink Prediction | CodeCode Available | 0 |