SOTAVerified

Benchmarking

Papers

Showing 21412150 of 5548 papers

TitleStatusHype
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese MedicineCode2
Scaffold Splits Overestimate Virtual Screening Performance0
WebSuite: Systematically Evaluating Why Web Agents FailCode0
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation ModelsCode1
On the project risk baseline: integrating aleatory uncertainty into project scheduling0
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wildCode1
SECURE: Benchmarking Large Language Models for CybersecurityCode1
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images0
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
CoSy: Evaluating Textual Explanations of Neurons0
Show:102550
← PrevPage 215 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified