SOTAVerified

Benchmarking

Papers

Showing 18011825 of 5548 papers

TitleStatusHype
Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection0
On the Evaluation Consistency of Attribution-based ExplanationsCode0
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office AutomationCode1
Benchmarking Dependence Measures to Prevent Shortcut Learning in Medical ImagingCode0
Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems0
VoxSim: A perceptual voice similarity datasetCode1
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding AgentsCode3
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images0
GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy0
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionCode1
Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency careCode1
Building a Domain-specific Guardrail Model in Production0
Quality Assured: Rethinking Annotation Strategies in Imaging AI0
HumanVid: Demystifying Training Data for Camera-controllable Human Image AnimationCode3
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement LearningCode2
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Hi-EF: Benchmarking Emotion Forecasting in Human-interactionCode0
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
Can time series forecasting be automated? A benchmark and analysis0
BONES: a Benchmark fOr Neural Estimation of Shapley valuesCode0
AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic BenchmarkingCode3
Aggregated Attributions for Explanatory Analysis of 3D Segmentation ModelsCode0
InLUT3D: Challenging real indoor dataset for point cloud analysis0
Unlocking the Potential: Benchmarking Large Language Models in Water Engineering and Research0
Show:102550
← PrevPage 73 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified