SOTAVerified

Benchmarking

Papers

Showing 131140 of 5548 papers

TitleStatusHype
Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting0
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra0
CuRe: Cultural Gaps in the Long Tail of Text-to-Image SystemsCode0
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents0
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments0
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence ReasoningCode0
How Far Are We from Optimal Reasoning Efficiency?Code0
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and MappingCode0
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures0
DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection0
Show:102550
← PrevPage 14 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified