SOTAVerified

Benchmarking

Papers

Showing 18011850 of 5548 papers

TitleStatusHype
Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection0
On the Evaluation Consistency of Attribution-based ExplanationsCode0
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office AutomationCode1
Benchmarking Dependence Measures to Prevent Shortcut Learning in Medical ImagingCode0
Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems0
VoxSim: A perceptual voice similarity datasetCode1
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding AgentsCode3
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction TasksCode1
SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images0
GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy0
Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency careCode1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionCode1
Building a Domain-specific Guardrail Model in Production0
Quality Assured: Rethinking Annotation Strategies in Imaging AI0
HumanVid: Demystifying Training Data for Camera-controllable Human Image AnimationCode3
MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement LearningCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Can time series forecasting be automated? A benchmark and analysis0
Hi-EF: Benchmarking Emotion Forecasting in Human-interactionCode0
BONES: a Benchmark fOr Neural Estimation of Shapley valuesCode0
AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic BenchmarkingCode3
Aggregated Attributions for Explanatory Analysis of 3D Segmentation ModelsCode0
InLUT3D: Challenging real indoor dataset for point cloud analysis0
Unlocking the Potential: Benchmarking Large Language Models in Water Engineering and Research0
Benchmarks as Microscopes: A Call for Model Metrology0
Cascaded two-stage feature clustering and selection via separability and consistency in fuzzy decision systems0
LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class TaxonomiesCode1
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal ReasoningCode1
Open-CD: A Comprehensive Toolbox for Change Detection0
StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation0
Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QACode0
Non-Reference Quality Assessment for Medical Imaging: Application to Synthetic Brain MRIs0
POGEMA: A Benchmark Platform for Cooperative Multi-Agent PathfindingCode1
Benchmarking deep learning models for bearing fault diagnosis using the CWRU dataset: A multi-label approach0
OCTrack: Benchmarking the Open-Corpus Multi-Object Tracking0
Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection0
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and EvaluationsCode1
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?Code7
Vision-Based Power Line Cables and Pylons Detection for Low Flying Aircraft0
SHS: Scorpion Hunting Strategy Swarm Algorithm0
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance0
RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark0
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle0
Restore Anything Model via Efficient Degradation AdaptationCode1
Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's DiseaseCode0
Comprehensive Review and Empirical Evaluation of Causal Discovery Algorithms for Numerical Data0
Temporal receptive field in dynamic graph learning: A comprehensive analysisCode0
Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual RelationshipsCode0
Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?0
Show:102550
← PrevPage 37 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified