Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2451–2475 of 5548 papers

Title	Date	Tasks	Status	Score
Strong and Simple Baselines for Multimodal Utterance Embeddings	May 14, 2019	Benchmarking	CodeCode Available	5
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams	Jun 17, 2024	AllBenchmarking	CodeCode Available	5
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models	Jun 8, 2023	BenchmarkingFairness	CodeCode Available	5
Benchmarking Large Language Models for Math Reasoning Tasks	Aug 20, 2024	BenchmarkingIn-Context Learning	CodeCode Available	5
Benchmarking Large Language Models for Image Classification of Marine Mammals	Oct 22, 2024	Benchmarkingimage-classification	CodeCode Available	5
Flexible Generation of Preference Data for Recommendation Analysis	Jul 23, 2024	BenchmarkingRecommendation Systems	CodeCode Available	5
Divergent Creativity in Humans and Large Language Models	May 13, 2024	Benchmarking	CodeCode Available	5
Local manifold learning and its link to domain-based physics knowledge	Jul 1, 2022	BenchmarkingDimensionality Reduction	CodeCode Available	5
Distributional Depth-Based Estimation of Object Articulation Models	Aug 12, 2021	BenchmarkingObject	CodeCode Available	5
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation	Oct 29, 2021	BenchmarkingBrain Tumor Segmentation	CodeCode Available	5
A Framework for Generating Informative Benchmark Instances	May 29, 2022	Benchmarking	CodeCode Available	5
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree search	Jan 26, 2025	BenchmarkingDiversity	CodeCode Available	5
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice	Dec 20, 2024	BenchmarkingDiagnostic	CodeCode Available	5
Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client Availability	Feb 18, 2020	BenchmarkingFederated Learning	CodeCode Available	5
Generalization and Regularization in DQN	Sep 29, 2018	Atari GamesBenchmarking	CodeCode Available	5
Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI	Mar 7, 2024	Benchmarking	CodeCode Available	5
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem	Feb 11, 2025	BenchmarkingDiversity	CodeCode Available	5
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions	Aug 2, 2024	Benchmarkingmultimodal interaction	CodeCode Available	5
Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection	Aug 22, 2023	BenchmarkingOut-of-Distribution Detection	CodeCode Available	5
Experimental Analysis of Large-scale Learnable Vector Storage Compression	Nov 27, 2023	Benchmarking	CodeCode Available	5
Benchmarking Large Language Models for Molecule Prediction Tasks	Mar 8, 2024	BenchmarkingPrediction	CodeCode Available	5
DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions	May 8, 2025	Autonomous NavigationBenchmarking	CodeCode Available	5
Are Large Language Models Good at Utility Judgments?	Mar 28, 2024	Answer GenerationBenchmarking	CodeCode Available	5
DispaRisk: Auditing Fairness Through Usable Information	May 20, 2024	BenchmarkingBias Detection	CodeCode Available	5
GenderBench: Evaluation Suite for Gender Biases in LLMs	May 17, 2025	Benchmarking	CodeCode Available	5

Show:10 25 50

← PrevPage 99 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified