| Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting | Jun 9, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra | Jun 9, 2025 | 3D ReconstructionBenchmarking | —Unverified | 0 |
| CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems | Jun 9, 2025 | AttributeBenchmarking | CodeCode Available | 0 |
| SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents | Jun 9, 2025 | BenchmarkingSynthetic Data Generation | —Unverified | 0 |
| EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments | Jun 9, 2025 | BenchmarkingNavigate | —Unverified | 0 |
| Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence Reasoning | Jun 9, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| How Far Are We from Optimal Reasoning Efficiency? | Jun 8, 2025 | 16kBenchmarking | CodeCode Available | 0 |
| LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping | Jun 7, 2025 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 0 |
| BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures | Jun 6, 2025 | BenchmarkingCPU | —Unverified | 0 |
| DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection | Jun 6, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |