| AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science | May 25, 2025 | BenchmarkingFeature Engineering | —Unverified | 0 |
| DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research | May 25, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning | May 25, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 |
| Benchmarking Laparoscopic Surgical Image Restoration and Beyond | May 25, 2025 | BenchmarkingImage Restoration | CodeCode Available | 2 |
| SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs | May 25, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | May 25, 2025 | BenchmarkingMulti-Agent Path Finding | —Unverified | 0 |
| EnvSDD: Benchmarking Environmental Sound Deepfake Detection | May 25, 2025 | Audio Deepfake DetectionAudio Generation | —Unverified | 0 |
| Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking | May 25, 2025 | BenchmarkingChunking | —Unverified | 0 |
| Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments | May 25, 2025 | Benchmarking | —Unverified | 0 |