Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1151–1175 of 5548 papers

Title	Date	Tasks	Status	Hype
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling	Jun 10, 2025	Benchmarking	CodeCode Available	1
dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal Processing	Apr 27, 2021	BenchmarkingRetrieval	CodeCode Available	1
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs	Feb 13, 2025	BenchmarkingRetrieval	CodeCode Available	1
Don’t be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System	Nov 1, 2021	BenchmarkingResponse Generation	CodeCode Available	1
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data	Feb 27, 2024	Benchmarking	CodeCode Available	1
Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark	Jun 8, 2022	BenchmarkingExplainable Artificial Intelligence (XAI)	CodeCode Available	1
CodeS: Natural Language to Code Repository via Multi-Layer Sketch	Mar 25, 2024	Benchmarking	CodeCode Available	1
A Closer Look at Mortality Risk Prediction from Electrocardiograms	Jun 24, 2024	BenchmarkingPrediction	CodeCode Available	1
Benchmarking MRI Reconstruction Neural Networks on Large Public Datasets	Mar 6, 2020	BenchmarkingImage Reconstruction	CodeCode Available	1
A global analysis of metrics used for measuring performance in natural language processing	Apr 25, 2022	BenchmarkingMachine Translation	CodeCode Available	1
A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models	Mar 31, 2023	BenchmarkingCausal Discovery	CodeCode Available	1
Benchmarking Large Language Models for News Summarization	Jan 31, 2023	BenchmarkingNews Summarization	CodeCode Available	1
A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance Imaging	Apr 26, 2020	BenchmarkingLeft Atrium Segmentation	CodeCode Available	1
Benchmarking Multidomain English-Indonesian Machine Translation	May 1, 2020	BenchmarkingMachine Translation	CodeCode Available	1
CODEMENV: Benchmarking Large Language Models on Code Migration	Jun 1, 2025	Benchmarking	CodeCode Available	1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking	Jan 22, 2020	Benchmarkingobject-detection	CodeCode Available	1
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates	Jul 8, 2024	Benchmarkingknowledge editing	CodeCode Available	1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework	Dec 7, 2022	Benchmarking	CodeCode Available	1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test Suite	Mar 15, 2019	Benchmarking	CodeCode Available	1
Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models	May 26, 2025	BenchmarkingRAG	CodeCode Available	1
A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective Optimization	Aug 10, 2023	BenchmarkingDecision Making	CodeCode Available	1
EDFace-Celeb-1M: Benchmarking Face Hallucination with a Million-scale Dataset	Oct 11, 2021	BenchmarkingFace Hallucination	CodeCode Available	1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization	Apr 6, 2025	BenchmarkingCombinatorial Optimization	CodeCode Available	1
Benchmarking and scaling of deep learning models for land cover image classification	Nov 18, 2021	BenchmarkingClassification	CodeCode Available	1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform	Oct 12, 2021	Benchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 47 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified