SOTAVerified

Benchmarking

Papers

Showing 6170 of 5548 papers

TitleStatusHype
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and OptimizationCode3
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real WebsitesCode3
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIsCode3
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language PretrainingCode3
nnInteractive: Redefining 3D Promptable SegmentationCode3
Robust Latent Matters: Boosting Image Generation with Sampling ErrorCode3
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action DetectionCode3
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life PredictionCode3
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation TasksCode3
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM AgentsCode3
Show:102550
← PrevPage 7 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified