| DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection | Jun 6, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques | Jun 6, 2025 | BenchmarkingModel Selection | —Unverified | 0 |
| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 |
| Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions | Jun 6, 2025 | BenchmarkingState Space Models | —Unverified | 0 |
| MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks | Jun 6, 2025 | Benchmarking | CodeCode Available | 0 |
| MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark | Jun 5, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Large Language Models on Homework Assessment in Circuit Analysis | Jun 5, 2025 | Benchmarking | —Unverified | 0 |
| EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition | Jun 5, 2025 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| Refer to Anything with Vision-Language Prompts | Jun 5, 2025 | BenchmarkingGeneralized Referring Expression Segmentation | —Unverified | 0 |
| DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models | Jun 5, 2025 | BenchmarkingDiversity | —Unverified | 0 |