SOTAVerified

Benchmarking

Papers

Showing 5175 of 5548 papers

TitleStatusHype
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket ConditioningCode2
QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges0
Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtypingCode0
Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey0
Simulation-Based Sensitivity Analysis in Optimal Treatment Regimes and Causal Decomposition with Individualized Interventions0
Staining normalization in histopathology: Method benchmarking using multicenter dataset0
Benchmarking Music Generation Models and Metrics via Human Preference Studies0
Survey of HPC in US Research Institutions0
Statistical Multicriteria Evaluation of LLM-Generated TextCode0
Identifiable Convex-Concave Regression via Sub-gradient Regularised Least Squares0
On the Robustness of Human-Object Interaction Detection against Distribution Shift0
TAB: Unified Benchmarking of Time Series Anomaly Detection MethodsCode2
Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking0
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
A Comparative Analysis of Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) as Dimensionality Reduction Techniques0
Universal Music Representations? Evaluating Foundation Models on World Music CorporaCode0
TabArena: A Living Benchmark for Machine Learning on Tabular DataCode3
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech SystemsCode1
Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors0
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents0
Finance Language Model Evaluation (FLaME)0
BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation ModelsCode2
PGLib-CO2: A Power Grid Library for Computing and Optimizing Carbon Emissions0
Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery0
ImpliRet: Benchmarking the Implicit Fact Retrieval ChallengeCode0
Show:102550
← PrevPage 3 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified