SOTAVerified

Benchmarking

Papers

Showing 16011610 of 5548 papers

TitleStatusHype
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents0
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine LearningCode0
HuSc3D: Human Sculpture dataset for 3D object reconstructionCode0
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments0
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence ReasoningCode0
Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding0
CuRe: Cultural Gaps in the Long Tail of Text-to-Image SystemsCode0
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis0
How Far Are We from Optimal Reasoning Efficiency?Code0
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and MappingCode0
Show:102550
← PrevPage 161 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified