SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1601–1610 of 5548 papers

Title	Date	Tasks	Status	Hype
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents	Jun 9, 2025	BenchmarkingSynthetic Data Generation	—Unverified	0
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning	Jun 9, 2025	Active LearningBenchmarking	CodeCode Available	0
HuSc3D: Human Sculpture dataset for 3D object reconstruction	Jun 9, 2025	3D Object Reconstruction3D Reconstruction	CodeCode Available	0
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments	Jun 9, 2025	BenchmarkingNavigate	—Unverified	0
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence Reasoning	Jun 9, 2025	BenchmarkingDiagnostic	CodeCode Available	0
Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding	Jun 9, 2025	BenchmarkingVideo Compression	—Unverified	0
CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems	Jun 9, 2025	AttributeBenchmarking	CodeCode Available	0
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis	Jun 9, 2025	Action ClassificationBenchmarking	—Unverified	0
How Far Are We from Optimal Reasoning Efficiency?	Jun 8, 2025	16kBenchmarking	CodeCode Available	0
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping	Jun 7, 2025	BenchmarkingSimultaneous Localization and Mapping	CodeCode Available	0

Show:10 25 50

← PrevPage 161 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified