| VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models | May 21, 2025 | Benchmarking | CodeCode Available | 0 |
| UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | May 21, 2025 | BenchmarkingImitation Learning | —Unverified | 0 |
| Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks | May 21, 2025 | BenchmarkingGPU | —Unverified | 0 |
| AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals | May 21, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models | May 21, 2025 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation | May 21, 2025 | BenchmarkingRAG | —Unverified | 0 |
| Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents | May 21, 2025 | BenchmarkingDecompensation | —Unverified | 0 |
| Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking | May 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| MedBrowseComp: Benchmarking Medical Deep Research and Computer Use | May 20, 2025 | Benchmarking | —Unverified | 0 |