SOTAVerified

Benchmarking

Papers

Showing 151160 of 5548 papers

TitleStatusHype
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM AgentsCode2
VERINA: Benchmarking Verifiable Code GenerationCode2
LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization AlgorithmsCode2
Benchmarking Laparoscopic Surgical Image Restoration and BeyondCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species ClassificationCode2
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and ThoroughlyCode2
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and EnhancementCode2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
MINERVA: Evaluating Complex Video ReasoningCode2
Show:102550
← PrevPage 16 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified