SOTAVerified

Benchmarking

Papers

Showing 51100 of 5548 papers

TitleStatusHype
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket ConditioningCode2
QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges0
Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtypingCode0
Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey0
Simulation-Based Sensitivity Analysis in Optimal Treatment Regimes and Causal Decomposition with Individualized Interventions0
Staining normalization in histopathology: Method benchmarking using multicenter dataset0
Survey of HPC in US Research Institutions0
Benchmarking Music Generation Models and Metrics via Human Preference Studies0
Identifiable Convex-Concave Regression via Sub-gradient Regularised Least Squares0
Statistical Multicriteria Evaluation of LLM-Generated TextCode0
On the Robustness of Human-Object Interaction Detection against Distribution Shift0
TAB: Unified Benchmarking of Time Series Anomaly Detection MethodsCode2
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking0
A Comparative Analysis of Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) as Dimensionality Reduction Techniques0
Universal Music Representations? Evaluating Foundation Models on World Music CorporaCode0
TabArena: A Living Benchmark for Machine Learning on Tabular DataCode3
Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors0
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech SystemsCode1
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents0
Finance Language Model Evaluation (FLaME)0
BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation ModelsCode2
Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery0
PGLib-CO2: A Power Grid Library for Computing and Optimizing Carbon Emissions0
A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning0
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World AnomaliesCode1
ImpliRet: Benchmarking the Implicit Fact Retrieval ChallengeCode0
Egocentric Human-Object Interaction Detection: A New Benchmark and Method0
The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor ProductsCode1
C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized RecommendationCode0
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects0
Deep Diffusion Models and Unsupervised Hyperspectral Unmixing for Realistic Abundance Map Synthesis0
Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring0
Robustness of Reinforcement Learning-Based Traffic Signal Control under Incidents: A Comparative Study0
JENGA: Object selection and pose estimation for robotic grasping from a stack0
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
A large-scale, physically-based synthetic dataset for satellite pose estimation0
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library ScenariosCode0
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and MetricsCode4
Delving into Instance-Dependent Label Noise in Graph Data: A Comprehensive Study and BenchmarkCode0
ANIRA: An Architecture for Neural Network Inference in Real-Time Audio ApplicationsCode3
Learning Best Paths in Quantum Networks0
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables0
SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics0
Temporal cross-validation impacts multivariate time series subsequence anomaly detection evaluation0
crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 20230
EconGym: A Scalable AI Testbed with Diverse Economic Tasks0
Mind the XAI Gap: A Human-Centered LLM Framework for Democratizing Explainable AICode0
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security TasksCode2
HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation0
Show:102550
← PrevPage 2 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified