Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 5076–5100 of 5548 papers

Title	Date	Tasks	Status
Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation	Sep 24, 2024	BenchmarkingMovie Recommendation	CodeCode Available
OG-SPACE: Optimized Stochastic Simulation of Spatial Models of Cancer Evolution	Oct 13, 2021	Benchmarking	CodeCode Available
Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail Promotions	May 16, 2024	BenchmarkingReinforcement Learning (RL)	CodeCode Available
Okapi: Generalising Better by Making Statistical Matches Match	Nov 7, 2022	BenchmarkingBinary Classification	CodeCode Available
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations	Jan 24, 2022	BenchmarkingDrug Discovery	CodeCode Available
DQI: Measuring Data Quality in NLP	May 2, 2020	Active LearningBenchmarking	CodeCode Available
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks	Jan 29, 2024	BenchmarkingCross-Lingual Transfer	CodeCode Available
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction	May 23, 2023	Aspect-Based Sentiment AnalysisAspect-Based Sentiment Analysis (ABSA)	CodeCode Available
WebSuite: Systematically Evaluating Why Web Agents Fail	Jun 1, 2024	BenchmarkingDiagnostic	CodeCode Available
Domain2Vec: Domain Embedding for Unsupervised Domain Adaptation	Jul 17, 2020	BenchmarkingDisentanglement	CodeCode Available
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification	Jul 18, 2022	BenchmarkingBIG-bench Machine Learning	CodeCode Available
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks	Nov 15, 2023	BenchmarkingNetwork Pruning	CodeCode Available
Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M	May 15, 2025	BenchmarkingMemorization	CodeCode Available
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs	Apr 7, 2025	BenchmarkingFairness	CodeCode Available
A Review of Testing Object-Based Environment Perception for Safe Automated Driving	Feb 16, 2021	BenchmarkingSensor Modeling	CodeCode Available
Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo	May 3, 2024	BenchmarkingMulti-hop Question Answering	CodeCode Available
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models	Feb 21, 2025	BenchmarkingDiagnostic	CodeCode Available
On dataset transferability in medical image classification	Dec 28, 2024	BenchmarkingClassification	CodeCode Available
Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions?	May 7, 2025	BenchmarkingSemantic Segmentation	CodeCode Available
Do LLM Evaluators Prefer Themselves for a Reason?	Apr 4, 2025	BenchmarkingCode Generation	CodeCode Available
YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems	Jul 26, 2023	BenchmarkingCPU	CodeCode Available
Benchmarking Long-tail Generalization with Likelihood Splits	Oct 13, 2022	BenchmarkingLanguage Modeling	CodeCode Available
UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking	May 21, 2025	BenchmarkingClaim Verification	CodeCode Available
On Empirical Comparisons of Optimizers for Deep Learning	Oct 11, 2019	BenchmarkingDeep Learning	CodeCode Available
Benchmarking LLMs' Judgments with No Gold Standard	Nov 11, 2024	BenchmarkingMachine Translation	CodeCode Available

Show:10 25 50

← PrevPage 204 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified