SOTAVerified

Benchmarking

Papers

Showing 641650 of 5548 papers

TitleStatusHype
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language ModelsCode1
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and BeyondCode1
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational AgentsCode1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
VoxSim: A perceptual voice similarity datasetCode1
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office AutomationCode1
Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency careCode1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionCode1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
Show:102550
← PrevPage 65 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified