Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1451–1475 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy	Oct 23, 2020	BenchmarkingDiagnostic	CodeCode Available	1	5
Just Rank: Rethinking Evaluation with Word and Sentence Similarities	Mar 5, 2022	BenchmarkingSemantic Similarity	CodeCode Available	1	5
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging	Jun 6, 2025	Benchmarking	CodeCode Available	1	5
"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters	Oct 13, 2023	BenchmarkingFairness	CodeCode Available	1	5
Beyond neural scaling laws: beating power law scaling via data pruning	Jun 29, 2022	Benchmarking	CodeCode Available	1	5
Bag of Tricks for Adversarial Training	Oct 1, 2020	Adversarial RobustnessBenchmarking	CodeCode Available	1	5
BEND: Benchmarking DNA Language Models on biologically meaningful tasks	Nov 21, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1	5
Leveraging Trust for Joint Multi-Objective and Multi-Fidelity Optimization	Dec 27, 2021	Bayesian OptimizationBenchmarking	CodeCode Available	1	5
Beyond Normal: On the Evaluation of Mutual Information Estimators	Jun 19, 2023	BenchmarkingDomain Generalization	CodeCode Available	1	5
Experimental Validation of Ultrasound Beamforming with End-to-End Deep Learning for Single Plane Wave Imaging	Apr 22, 2024	Benchmarking	CodeCode Available	1	5
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration	May 25, 2023	BenchmarkingFace Recognition	CodeCode Available	1	5
RobustBench: a standardized adversarial robustness benchmark	Oct 19, 2020	Adversarial RobustnessBenchmarking	CodeCode Available	1	5
Benchmarking Graph Neural Networks on Dynamic Link Prediction	Sep 29, 2021	BenchmarkingDynamic Link Prediction	CodeCode Available	1	5
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages	Mar 11, 2024	BenchmarkingData Augmentation	CodeCode Available	1	5
Benchmarking Graph Neural Networks for FMRI analysis	Nov 16, 2022	Benchmarking	CodeCode Available	1	5
Exploring Large Language Models for Classical Philology	May 23, 2023	BenchmarkingDecoder	CodeCode Available	1	5
EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions	Jun 8, 2021	Bayesian OptimisationBenchmarking	CodeCode Available	1	5
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization	Apr 6, 2025	BenchmarkingCombinatorial Optimization	CodeCode Available	1	5
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models	Jul 16, 2024	BenchmarkingCode Generation	CodeCode Available	1	5
Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts	Nov 7, 2023	BenchmarkingMachine Translation	CodeCode Available	1	5
RobFR: Benchmarking Adversarial Robustness on Face Recognition	Jul 8, 2020	Adversarial RobustnessBenchmarking	CodeCode Available	1	5
Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot Systems	Jun 28, 2021	3D ReconstructionBenchmarking	CodeCode Available	1	5
MatTools: Benchmarking Large Language Models for Materials Science Tools	May 16, 2025	BenchmarkingQuestion Answering	CodeCode Available	1	5
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods	Jun 15, 2023	BenchmarkingFairness	CodeCode Available	1	5
Benchmarking Knowledge-driven Zero-shot Learning	Jun 29, 2021	AttributeBenchmarking	CodeCode Available	1	5

Show:10 25 50

← PrevPage 59 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified