| Benchmarking Laparoscopic Surgical Image Restoration and Beyond | May 25, 2025 | BenchmarkingImage Restoration | CodeCode Available | 2 |
| Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | May 25, 2025 | BenchmarkingMulti-Agent Path Finding | —Unverified | 0 |
| EnvSDD: Benchmarking Environmental Sound Deepfake Detection | May 25, 2025 | Audio Deepfake DetectionAudio Generation | —Unverified | 0 |
| Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking | May 25, 2025 | BenchmarkingChunking | —Unverified | 0 |
| Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments | May 25, 2025 | Benchmarking | —Unverified | 0 |
| Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset | May 24, 2025 | BenchmarkingRAG | CodeCode Available | 0 |
| Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs | May 24, 2025 | Benchmarking | —Unverified | 0 |
| From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation | May 24, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 |
| LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges | May 24, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 0 |
| Benchmarking and Rethinking Knowledge Editing for Large Language Models | May 24, 2025 | Benchmarkingknowledge editing | CodeCode Available | 0 |
| SPDEBench: An Extensive Benchmark for Learning Regular and Singular Stochastic PDEs | May 24, 2025 | Benchmarking | CodeCode Available | 0 |
| SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models | May 24, 2025 | BenchmarkingVideo Grounding | —Unverified | 0 |
| ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation | May 24, 2025 | BenchmarkingChart Understanding | CodeCode Available | 3 |
| Benchmarking Poisoning Attacks against Retrieval-Augmented Generation | May 24, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection | May 24, 2025 | BenchmarkingImage Forgery Detection | —Unverified | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 |
| Benchmark for Antibody Binding Affinity Maturation and Design | May 23, 2025 | Benchmarking | —Unverified | 0 |
| U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | May 23, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| 3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method Evaluation | May 23, 2025 | 3D Face ReconstructionBenchmarking | CodeCode Available | 0 |
| A Position Paper on the Automatic Generation of Machine Learning Leaderboards | May 23, 2025 | BenchmarkingPosition | CodeCode Available | 0 |
| SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification | May 23, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints | May 23, 2025 | Benchmarking | —Unverified | 0 |
| PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language | May 23, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow | May 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |