| SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents | Jun 9, 2025 | BenchmarkingSynthetic Data Generation | —Unverified | 0 |
| The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning | Jun 9, 2025 | Active LearningBenchmarking | CodeCode Available | 0 |
| HuSc3D: Human Sculpture dataset for 3D object reconstruction | Jun 9, 2025 | 3D Object Reconstruction3D Reconstruction | CodeCode Available | 0 |
| EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments | Jun 9, 2025 | BenchmarkingNavigate | —Unverified | 0 |
| Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence Reasoning | Jun 9, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding | Jun 9, 2025 | BenchmarkingVideo Compression | —Unverified | 0 |
| CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems | Jun 9, 2025 | AttributeBenchmarking | CodeCode Available | 0 |
| SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis | Jun 9, 2025 | Action ClassificationBenchmarking | —Unverified | 0 |
| How Far Are We from Optimal Reasoning Efficiency? | Jun 8, 2025 | 16kBenchmarking | CodeCode Available | 0 |
| LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping | Jun 7, 2025 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 0 |