| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |
| A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents | May 21, 2025 | BenchmarkingDecompensation | —Unverified | 0 |
| Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks | May 21, 2025 | BenchmarkingGPU | —Unverified | 0 |
| NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction | May 21, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation | May 20, 2025 | Autonomous NavigationBenchmarking | —Unverified | 0 |
| ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations | May 20, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking data encoding methods in Quantum Machine Learning | May 20, 2025 | BenchmarkingQuantum Machine Learning | —Unverified | 0 |
| MedBrowseComp: Benchmarking Medical Deep Research and Computer Use | May 20, 2025 | Benchmarking | —Unverified | 0 |
| DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis | May 20, 2025 | BenchmarkingFairness | —Unverified | 0 |
| Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach | May 20, 2025 | Benchmarking | —Unverified | 0 |
| TransBench: Benchmarking Machine Translation for Industrial-Scale Applications | May 20, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| A Data-Driven Method to Identify IBRs with Dominant Participation in Sub-Synchronous Oscillations | May 20, 2025 | Benchmarking | —Unverified | 0 |
| SlangDIT: Benchmarking LLMs in Interpretative Slang Translation | May 20, 2025 | BenchmarkingSentence | —Unverified | 0 |
| LLM-based Evaluation Policy Extraction for Ecological Modeling | May 20, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI | May 20, 2025 | Anomaly LocalizationBenchmarking | —Unverified | 0 |
| SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival Analysis | May 20, 2025 | BenchmarkingModel Optimization | CodeCode Available | 0 |
| SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas | May 20, 2025 | BenchmarkingLogical Reasoning | —Unverified | 0 |
| Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning | May 19, 2025 | Benchmarking | CodeCode Available | 0 |
| LEXam: Benchmarking Legal Reasoning on 340 Law Exams | May 19, 2025 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models | May 19, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings | May 19, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference | May 19, 2025 | BenchmarkingCausal Inference | —Unverified | 0 |
| SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference | May 19, 2025 | BenchmarkingEEG | —Unverified | 0 |
| Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning | May 19, 2025 | Benchmarking | —Unverified | 0 |
| A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design | May 19, 2025 | BenchmarkingDrug Discovery | —Unverified | 0 |