| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |
| Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | May 21, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI | May 21, 2025 | Benchmarking | —Unverified | 0 |
| Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models | May 21, 2025 | BenchmarkingPrompt Engineering | —Unverified | 0 |
| VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models | May 21, 2025 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking | May 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents | May 21, 2025 | BenchmarkingDecompensation | —Unverified | 0 |
| Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks | May 21, 2025 | BenchmarkingGPU | —Unverified | 0 |
| DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis | May 20, 2025 | BenchmarkingFairness | —Unverified | 0 |