| A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge | May 8, 2025 | Benchmarking | CodeCode Available | 0 |
| Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents | May 8, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection | May 8, 2025 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions | May 8, 2025 | Autonomous NavigationBenchmarking | CodeCode Available | 0 |
| Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power Transformers | May 7, 2025 | BenchmarkingFault Detection | CodeCode Available | 0 |
| Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions? | May 7, 2025 | BenchmarkingSemantic Segmentation | CodeCode Available | 0 |
| False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims | May 7, 2025 | Benchmarking | CodeCode Available | 0 |
| Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards | May 7, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| Advancing and Benchmarking Personalized Tool Invocation for LLMs | May 7, 2025 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| Benchmarking LLMs' Swarm intelligence | May 7, 2025 | Benchmarking | CodeCode Available | 1 |