SOTAVerified

Benchmarking

Papers

Showing 15511600 of 5548 papers

TitleStatusHype
Statistical Multicriteria Evaluation of LLM-Generated TextCode0
Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking0
A Comparative Analysis of Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) as Dimensionality Reduction Techniques0
Universal Music Representations? Evaluating Foundation Models on World Music CorporaCode0
Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors0
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents0
Finance Language Model Evaluation (FLaME)0
PGLib-CO2: A Power Grid Library for Computing and Optimizing Carbon Emissions0
Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery0
ImpliRet: Benchmarking the Implicit Fact Retrieval ChallengeCode0
A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning0
Egocentric Human-Object Interaction Detection: A New Benchmark and Method0
Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring0
JENGA: Object selection and pose estimation for robotic grasping from a stack0
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects0
Deep Diffusion Models and Unsupervised Hyperspectral Unmixing for Realistic Abundance Map Synthesis0
C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized RecommendationCode0
Robustness of Reinforcement Learning-Based Traffic Signal Control under Incidents: A Comparative Study0
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library ScenariosCode0
A large-scale, physically-based synthetic dataset for satellite pose estimation0
Learning Best Paths in Quantum Networks0
Delving into Instance-Dependent Label Noise in Graph Data: A Comprehensive Study and BenchmarkCode0
EconGym: A Scalable AI Testbed with Diverse Economic Tasks0
Temporal cross-validation impacts multivariate time series subsequence anomaly detection evaluation0
SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics0
Mind the XAI Gap: A Human-Centered LLM Framework for Democratizing Explainable AICode0
crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 20230
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables0
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics0
HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation0
Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning0
Sum Rate Maximization for Pinching Antennas Assisted RSMA System With Multiple Waveguides0
FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models0
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person ScenariosCode0
ScholarSearch: Benchmarking Scholar Searching Ability of LLMs0
ICE-ID: A Novel Historical Census Data Benchmark Comparing NARS against LLMs, \& a ML Ensemble on Longitudinal Identity Resolution0
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents0
Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models0
GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments0
A Manually Annotated Image-Caption Dataset for Detecting Children in the WildCode0
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens0
Graph Attention-based Decentralized Actor-Critic for Dual-Objective Control of Multi-UAV Swarms0
AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP0
Solving excited states for long-range interacting trapped ions with neural networks0
Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech0
Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework0
GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors0
Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting0
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine LearningCode0
Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding0
Show:102550
← PrevPage 32 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified