SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–10 of 5548 papers

Title	Date	Tasks	Status	Hype
WebWalker: Benchmarking LLMs in Web Traversal	Jan 13, 2025	BenchmarkingOpen-Domain Question Answering	CodeCode Available	11
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models	Mar 12, 2024	Benchmarking	CodeCode Available	9
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning	Jan 25, 2025	BenchmarkingEvolutionary Algorithms	CodeCode Available	7
CALE: Continuous Arcade Learning Environment	Oct 31, 2024	Atari GamesBenchmarking	CodeCode Available	7
Segment Anything in Medical Images and Videos: Benchmark and Deployment	Aug 6, 2024	BenchmarkingSegmentation	CodeCode Available	7
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?	Jul 19, 2024	BenchmarkingCode Generation	CodeCode Available	7
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking	Jun 21, 2024	Autonomous DrivingBenchmarking	CodeCode Available	7
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	Apr 11, 2024	Benchmarking	CodeCode Available	7
Better than classical? The subtle art of benchmarking quantum machine learning models	Mar 11, 2024	BenchmarkingBinary Classification	CodeCode Available	7
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	Feb 8, 2024	BenchmarkingDiversity	CodeCode Available	7

Show:10 25 50

← PrevPage 1 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified