SOTAVerified

Benchmarking

Papers

Showing 34513475 of 5548 papers

TitleStatusHype
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection0
Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems0
MA-BBOB: A Problem Generator for Black-Box Optimization Using Affine Combinations and Shifts0
MA-BBOB: Many-Affine Combinations of BBOB Functions for Evaluating AutoML Approaches in Noiseless Numerical Black-Box Optimization Contexts0
Towards an AI Accountability Policy0
Machine Generated Product Advertisements: Benchmarking LLMs Against Human Performance0
Towards an Automated SOAP Note: Classifying Utterances from Medical Conversations0
A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video0
Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)0
Towards a Taxonomy of Graph Learning Datasets0
Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices0
Machine learning for modelling unstructured grid data in computational physics: a review0
Towards a Theory-Guided Benchmarking Suite for Discrete Black-Box Optimization Heuristics: Profiling (1+λ) EA Variants on OneMax and LeadingOnes0
Machine Learning for Ranking f-wave Extraction Methods in Single-Lead ECGs0
Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving0
Uncertainty estimation of machine learning spatial precipitation predictions from satellite data0
Benchmarking LLMs for Mimicking Child-Caregiver Language in Interaction0
Benchmarking LLMs and SLMs for patient reported outcomes0
Benchmarking LLM powered Chatbots: Methods and Metrics0
Machine Vision based Sample-Tube Localization for Mars Sample Return0
Benchmarking LLM Guardrails in Handling Multilingual Toxicity0
Benchmarking LLM for Code Smells Detection: OpenAI GPT-4.0 vs DeepSeek-V30
Towards a Unified Framework for Determining Conformational Ensembles of Disordered Proteins0
Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios0
Making Sense of Data in the Wild: Data Analysis Automation at Scale0
Show:102550
← PrevPage 139 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified