SOTAVerified

Benchmarking

Papers

Showing 110 of 5548 papers

TitleStatusHype
WebWalker: Benchmarking LLMs in Web TraversalCode11
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language ModelsCode9
EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement LearningCode7
CALE: Continuous Arcade Learning EnvironmentCode7
Segment Anything in Medical Images and Videos: Benchmark and DeploymentCode7
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?Code7
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and BenchmarkingCode7
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCode7
Better than classical? The subtle art of benchmarking quantum machine learning modelsCode7
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language ModelsCode7
Show:102550
← PrevPage 1 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified