SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 61–70 of 5548 papers

Title	Date	Tasks	Status	Hype
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization	May 9, 2025	Benchmarking	CodeCode Available	3
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites	Apr 15, 2025	Autonomous Web NavigationBenchmarking	CodeCode Available	3
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs	Mar 26, 2025	Benchmarking	CodeCode Available	3
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining	Mar 23, 2025	3DGSBenchmarking	CodeCode Available	3
nnInteractive: Redefining 3D Promptable Segmentation	Mar 11, 2025	BenchmarkingInteractive Segmentation	CodeCode Available	3
Robust Latent Matters: Boosting Image Generation with Sampling Error	Mar 11, 2025	BenchmarkingImage Generation	CodeCode Available	3
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection	Feb 27, 2025	Action DetectionBenchmarking	CodeCode Available	3
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction	Feb 26, 2025	BenchmarkingTime Series	CodeCode Available	3
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks	Feb 7, 2025	Benchmarking	CodeCode Available	3
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents	Jan 24, 2025	Benchmarking	CodeCode Available	3

Show:10 25 50

← PrevPage 7 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified