SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–160 of 5548 papers

Title	Date	Tasks	Status	Hype
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents	May 30, 2025	BenchmarkingBlocking	CodeCode Available	2
VERINA: Benchmarking Verifiable Code Generation	May 29, 2025	BenchmarkingCode Generation	CodeCode Available	2
LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms	May 27, 2025	Bayesian OptimizationBenchmarking	CodeCode Available	2
Benchmarking Laparoscopic Surgical Image Restoration and Beyond	May 25, 2025	BenchmarkingImage Restoration	CodeCode Available	2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions	May 24, 2025	Benchmarking	CodeCode Available	2
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification	May 18, 2025	Benchmarking	CodeCode Available	2
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly	May 15, 2025	8kBenchmarking	CodeCode Available	2
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement	May 13, 2025	BenchmarkingLanguage Modeling	CodeCode Available	2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models	May 5, 2025	BenchmarkingMathematical Reasoning	CodeCode Available	2
MINERVA: Evaluating Complex Video Reasoning	May 1, 2025	BenchmarkingTemporal Localization	CodeCode Available	2

Show:10 25 50

← PrevPage 16 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified