SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 71–80 of 5548 papers

Title	Date	Tasks	Status	Hype
Advancing LLM Reasoning Generalists with Preference Trees	Apr 2, 2024	BenchmarkingCode Generation	CodeCode Available	3
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models	May 22, 2025	BenchmarkingInstruction Following	CodeCode Available	3
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning	Jun 5, 2023	Benchmarking	CodeCode Available	3
MLVU: Benchmarking Multi-task Long Video Understanding	Jun 6, 2024	BenchmarkingVideo Understanding	CodeCode Available	3
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving	May 27, 2024	Autonomous DrivingBenchmarking	CodeCode Available	3
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems	Sep 2, 2024	BenchmarkingInstruction Following	CodeCode Available	3
AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking	Jul 23, 2024	BenchmarkingTransfer Learning	CodeCode Available	3
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis	Oct 9, 2023	BenchmarkingMultivariate Time Series Forecasting	CodeCode Available	3
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation	Jun 19, 2024	BenchmarkingImage Generation	CodeCode Available	3
BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction	Feb 26, 2025	BenchmarkingTime Series	CodeCode Available	3

Show:10 25 50

← PrevPage 8 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified