SOTAVerified

Benchmarking

Papers

Showing 50515100 of 5548 papers

TitleStatusHype
Efficient and Accurate Optimal Transport with Mirror Descent and Conjugate GradientsCode0
SimbaML: Connecting Mechanistic Models and Machine Learning with Augmented DataCode0
NSINA: A News Corpus for SinhalaCode0
Improving Sequential Recommendation Models with an Enhanced Loss FunctionCode0
Aspect-based Sentiment Classification with Aspect-specific Graph Convolutional NetworksCode0
Editing Factual Knowledge and Explanatory Ability of Medical Large Language ModelsCode0
SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital TwinsCode0
A Seq2Seq approach to Symbolic RegressionCode0
A Collection of Quality Diversity Optimization Problems Derived from Hyperparameter Optimization of Machine Learning ModelsCode0
Simitate: A Hybrid Imitation Learning BenchmarkCode0
Echo State Networks with Self-Normalizing Activations on the Hyper-SphereCode0
ECBD: Evidence-Centered Benchmark Design for NLPCode0
A Continuous Optimisation Benchmark Suite from Neural Network RegressionCode0
An Evaluation of Machine Learning Approaches for Early Diagnosis of Autism Spectrum DisorderCode0
Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking TechniqueCode0
DynCIM: Dynamic Curriculum for Imbalanced Multimodal LearningCode0
DynamoRep: Trajectory-Based Population Dynamics for Classification of Black-box Optimization ProblemsCode0
Simple GNNs with Low Rank Non-parametric AggregatorsCode0
Effective Stabilized Self-Training on Few-Labeled Graph DataCode0
Simulated Contextual Bandits for Personalization Tasks from Recommendation DatasetsCode0
A Deep Reinforcement Learning Framework for Dynamic Portfolio Optimization: Evidence from China's Stock MarketCode0
DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge GraphsCode0
DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action RecognitionCode0
Referenced Thermodynamic Integration for Bayesian Model Selection: Application to COVID-19 Model SelectionCode0
Simulation-based Benchmarking for Causal Structure Learning in Gene Perturbation ExperimentsCode0
Ducho meets Elliot: Large-scale Benchmarks for Multimodal RecommendationCode0
OG-SPACE: Optimized Stochastic Simulation of Spatial Models of Cancer EvolutionCode0
Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail PromotionsCode0
Okapi: Generalising Better by Making Statistical Matches MatchCode0
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise AnnotationsCode0
DQI: Measuring Data Quality in NLPCode0
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling TasksCode0
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
WebSuite: Systematically Evaluating Why Web Agents FailCode0
Domain2Vec: Domain Embedding for Unsupervised Domain AdaptationCode0
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence ClassificationCode0
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two BenchmarksCode0
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1MCode0
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIsCode0
A Review of Testing Object-Based Environment Perception for Safe Automated DrivingCode0
Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-TurboCode0
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained modelsCode0
On dataset transferability in medical image classificationCode0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?Code0
Do LLM Evaluators Prefer Themselves for a Reason?Code0
YOLOBench: Benchmarking Efficient Object Detectors on Embedded SystemsCode0
Benchmarking Long-tail Generalization with Likelihood SplitsCode0
UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and BenchmarkingCode0
On Empirical Comparisons of Optimizers for Deep LearningCode0
Benchmarking LLMs' Judgments with No Gold StandardCode0
Show:102550
← PrevPage 102 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified