SOTAVerified

Benchmarking

Papers

Showing 971980 of 5548 papers

TitleStatusHype
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution ShiftsCode1
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation TasksCode3
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative TasksCode1
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and SoundCode4
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models0
Verifiable Format Control for Large Language Model Generations0
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEsCode0
Large Language Models for Multi-Robot Systems: A SurveyCode1
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset0
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization0
Show:102550
← PrevPage 98 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified