SOTAVerified

Benchmarking

Papers

Showing 50765100 of 5548 papers

TitleStatusHype
Ducho meets Elliot: Large-scale Benchmarks for Multimodal RecommendationCode0
OG-SPACE: Optimized Stochastic Simulation of Spatial Models of Cancer EvolutionCode0
Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail PromotionsCode0
Okapi: Generalising Better by Making Statistical Matches MatchCode0
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise AnnotationsCode0
DQI: Measuring Data Quality in NLPCode0
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling TasksCode0
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet ExtractionCode0
WebSuite: Systematically Evaluating Why Web Agents FailCode0
Domain2Vec: Domain Embedding for Unsupervised Domain AdaptationCode0
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence ClassificationCode0
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two BenchmarksCode0
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1MCode0
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIsCode0
A Review of Testing Object-Based Environment Perception for Safe Automated DrivingCode0
Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-TurboCode0
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained modelsCode0
On dataset transferability in medical image classificationCode0
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?Code0
Do LLM Evaluators Prefer Themselves for a Reason?Code0
YOLOBench: Benchmarking Efficient Object Detectors on Embedded SystemsCode0
Benchmarking Long-tail Generalization with Likelihood SplitsCode0
UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and BenchmarkingCode0
On Empirical Comparisons of Optimizers for Deep LearningCode0
Benchmarking LLMs' Judgments with No Gold StandardCode0
Show:102550
← PrevPage 204 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified