Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–500 of 5548 papers

Title	Date	Tasks	Status	Hype
An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks	Feb 7, 2025	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning	Nov 29, 2024	BenchmarkingDeepFake Detection	CodeCode Available	1
Large Scale MRI Collection and Segmentation of Cirrhotic Liver	Oct 6, 2024	BenchmarkingDiagnostic	CodeCode Available	1
AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning Potentials	Nov 29, 2022	Benchmarking	CodeCode Available	1
ClearPose: Large-scale Transparent Object Dataset and Benchmark	Mar 8, 2022	BenchmarkingDepth Completion	CodeCode Available	1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation	Nov 10, 2023	BenchmarkingCloud Computing	CodeCode Available	1
CLoG: Benchmarking Continual Learning of Image Generation Models	Jun 7, 2024	BenchmarkingContinual Learning	CodeCode Available	1
CodeS: Natural Language to Code Repository via Multi-Layer Sketch	Mar 25, 2024	Benchmarking	CodeCode Available	1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test Suite	Mar 15, 2019	Benchmarking	CodeCode Available	1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform	Oct 12, 2021	Benchmarking	CodeCode Available	1
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models	Nov 27, 2024	BenchmarkingEarth Observation	CodeCode Available	1
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4	Mar 20, 2023	BenchmarkingDe-identification	CodeCode Available	1
An Exploration of Embodied Visual Exploration	Jan 7, 2020	Benchmarking	CodeCode Available	1
AnomalyHop: An SSL-based Image Anomaly Localization Method	May 8, 2021	Anomaly LocalizationBenchmarking	CodeCode Available	1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling	Jan 21, 2024	Benchmarking	CodeCode Available	1
CharacterBench: Benchmarking Character Customization of Large Language Models	Dec 16, 2024	Benchmarking	CodeCode Available	1
On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic Writing	Jun 7, 2023	BenchmarkingPrompt Engineering	CodeCode Available	1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning Robustness	Jul 13, 2020	Benchmarking	CodeCode Available	1
Working Memory Capacity of ChatGPT: An Empirical Study	Apr 30, 2023	BenchmarkingLanguage Modeling	CodeCode Available	1
New Protocols and Negative Results for Textual Entailment Data Collection	Apr 24, 2020	BenchmarkingDiversity	CodeCode Available	1
CCTV-Gun: Benchmarking Handgun Detection in CCTV Images	Mar 19, 2023	Benchmarkingobject-detection	CodeCode Available	1
Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive?	Jun 15, 2023	Autonomous DrivingAutonomous Vehicles	CodeCode Available	1
CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital Twins	Jan 6, 2024	Autonomous VehiclesBenchmarking	CodeCode Available	1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification	Jun 18, 2023	BenchmarkingRetrieval	CodeCode Available	1
Accelerated and interpretable oblique random survival forests	Aug 1, 2022	BenchmarkingComputational Efficiency	CodeCode Available	1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies	Jun 15, 2025	Benchmarking	CodeCode Available	1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban Intersection	Apr 25, 2024	Benchmarkingobject-detection	CodeCode Available	1
ConsumerBench: Benchmarking Generative AI Applications on End-User Devices	Jun 21, 2025	BenchmarkingCPU	CodeCode Available	1
Benchmarking Visual Localization for Autonomous Navigation	Mar 24, 2022	Autonomous NavigationBenchmarking	CodeCode Available	1
CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs	Apr 5, 2021	BenchmarkingKnowledge Graphs	CodeCode Available	1
Chaos as an interpretable benchmark for forecasting and data-driven modelling	Oct 11, 2021	BenchmarkingSymbolic Regression	CodeCode Available	1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions	Jun 26, 2025	BenchmarkingDrug Design	CodeCode Available	1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning	Feb 20, 2024	Atomic number classificationBenchmarking	CodeCode Available	1
CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark	Jun 5, 2024	Benchmarking	CodeCode Available	1
AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoring	Jul 11, 2023	Benchmarking	CodeCode Available	1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial Labels	Jan 30, 2024	Benchmarkingimage-classification	CodeCode Available	1
Restore Anything Model via Efficient Degradation Adaptation	Jul 18, 2024	5-Degradation Blind All-in-One Image RestorationBenchmarking	CodeCode Available	1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer	Dec 2, 2021	BenchmarkingOrdinal Classification	CodeCode Available	1
Curious Hierarchical Actor-Critic Reinforcement Learning	May 7, 2020	BenchmarkingHierarchical Reinforcement Learning	CodeCode Available	1
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection	Mar 12, 2025	BenchmarkingCode Classification	CodeCode Available	1
DACBench: A Benchmark Library for Dynamic Algorithm Configuration	May 18, 2021	Benchmarking	CodeCode Available	1
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms	Aug 2, 2021	Benchmarkingcounterfactual	CodeCode Available	1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs	Feb 23, 2024	Benchmarkingslot-filling	CodeCode Available	1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs	Feb 21, 2025	Benchmarking	CodeCode Available	1
COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning	Jan 15, 2021	BenchmarkingMisinformation	CodeCode Available	1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework	Jun 12, 2024	BenchmarkingCausal Inference	CodeCode Available	1
A Platform for the Biomedical Application of Large Language Models	May 10, 2023	BenchmarkingPrivacy Preserving	CodeCode Available	1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models	May 19, 2025	BenchmarkingChatbot	CodeCode Available	1
Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk	Jul 2, 2022	BenchmarkingMachine Translation	CodeCode Available	1
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs	Jun 22, 2023	Arithmetic ReasoningBenchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 10 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified