SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 971–980 of 5548 papers

Title	Date	Tasks	Status	Hype
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts	Feb 8, 2025	BenchmarkingSelf-Supervised Learning	CodeCode Available	1
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks	Feb 7, 2025	Benchmarking	CodeCode Available	3
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks	Feb 7, 2025	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	1
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound	Feb 7, 2025	Benchmarking	CodeCode Available	4
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models	Feb 6, 2025	BenchmarkingEmotional Intelligence	—Unverified	0
Verifiable Format Control for Large Language Model Generations	Feb 6, 2025	BenchmarkingInstruction Following	—Unverified	0
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs	Feb 6, 2025	BenchmarkingEpidemiology	CodeCode Available	0
Large Language Models for Multi-Robot Systems: A Survey	Feb 6, 2025	Action GenerationBenchmarking	CodeCode Available	1
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset	Feb 6, 2025	BenchmarkingComputed Tomography (CT)	—Unverified	0
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization	Feb 6, 2025	BenchmarkingUncertainty Quantification	—Unverified	0

Show:10 25 50

← PrevPage 98 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified