SOTAVerified

Benchmarking

Papers

Showing 251260 of 5548 papers

TitleStatusHype
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?Code2
How far are today's time-series models from real-world weather forecasting applications?Code2
A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentationsCode2
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AICode2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language ModelsCode2
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language ModelsCode2
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMsCode2
BTS: Building Timeseries Dataset: Empowering Large-Scale Building AnalyticsCode2
StreamBench: Towards Benchmarking Continuous Improvement of Language AgentsCode2
Show:102550
← PrevPage 26 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified