Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1751–1800 of 5548 papers

Title	Date	Tasks	Status
DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs	Mar 20, 2025	BenchmarkingHallucination	—Unverified
CMOS based image cytometry for detection of phytoplankton in ballast water	Nov 21, 2016	Benchmarking	—Unverified
Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment	Aug 6, 2019	Atari GamesBenchmarking	—Unverified
Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics	Feb 18, 2025	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
CityLearn v2: Energy-flexible, resilient, occupant-centric, and carbon-aware management of grid-interactive communities	May 2, 2024	BenchmarkingManagement	—Unverified
Addressing the Real-world Class Imbalance Problem in Dermatology	Oct 9, 2020	BenchmarkingFew-Shot Learning	—Unverified
CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry	Jan 26, 2025	BenchmarkingObject Detection	—Unverified
A new dataset of dog breed images and a benchmark for fine-grained classification	Oct 1, 2020	BenchmarkingClassification	—Unverified
Benchmarking Automated Review Response Generation for the Hospitality Domain	Dec 1, 2020	BenchmarkingDomain Adaptation	—Unverified
Does AI for science need another ImageNet Or totally different benchmarks? A case study of machine learning force fields	Aug 11, 2023	Benchmarking	—Unverified
Benchmarking Automated Machine Learning Methods for Price Forecasting Applications	Apr 28, 2023	AutoMLBenchmarking	—Unverified
CIMLA: Interpretable AI for inference of differential causal networks	Apr 25, 2023	Benchmarking	—Unverified
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis	Mar 29, 2025	BenchmarkingLarge Language Model	—Unverified
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance	Jul 14, 2025	BenchmarkingCode Generation	—Unverified
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis	Oct 6, 2023	BenchmarkingDomain Generalization	—Unverified
CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations	Apr 19, 2025	Benchmarking	—Unverified
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings	Jan 2, 2025	BenchmarkingCode Generation	—Unverified
Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos	Jan 1, 2024	Benchmarking	—Unverified
Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios	Apr 16, 2025	Audio Deepfake DetectionBenchmarking	—Unverified
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks	Jul 14, 2025	BenchmarkingCode Generation	—Unverified
CI-Bench: Benchmarking Contextual Integrity of AI Assistants on Synthetic Data	Sep 20, 2024	BenchmarkingLanguage Modeling	—Unverified
Benchmarking Attention Mechanisms and Consistency Regularization Semi-Supervised Learning for Post-Flood Building Damage Assessment in Satellite Images	Dec 4, 2024	BenchmarkingBuilding Damage Assessment	—Unverified
An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models	May 23, 2024	Autonomous DrivingBenchmarking	—Unverified
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs	Sep 9, 2024	Benchmarkingknowledge editing	—Unverified
DLUE: Benchmarking Document Language Understanding	May 16, 2023	BenchmarkingDocument Classification	—Unverified
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools	Jan 1, 2025	Benchmarking	—Unverified
Benchmarking ASR Systems Based on Post-Editing Effort and Error Analysis	Jul 1, 2021	Benchmarking	—Unverified
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices	Jan 1, 2025	Benchmarking	—Unverified
LAraBench: Benchmarking Arabic AI with Large Language Models	May 24, 2023	BenchmarkingFew-Shot Learning	—Unverified
Cognitive Model Priors for Predicting Human Decisions	May 22, 2019	BenchmarkingBIG-bench Machine Learning	—Unverified
Coherent Feed Forward Quantum Neural Network	Feb 1, 2024	BenchmarkingDiagnostic	—Unverified
Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks	Apr 30, 2020	BenchmarkingCoherence Evaluation	—Unverified
ChemTime: Rapid and Early Classification for Multivariate Time Series Classification of Chemical Sensors	Dec 15, 2023	BenchmarkingClassification	—Unverified
An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition	Oct 16, 2023	BenchmarkingMicro Expression Recognition	—Unverified
Diverse Community Data for Benchmarking Data Privacy Algorithms	Jun 20, 2023	Benchmarking	—Unverified
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models	May 18, 2025	ArticlesBenchmarking	—Unverified
An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction	Nov 3, 2023	BenchmarkingSentence	—Unverified
Colonoscopy 3D Video Dataset with Paired Depth from 2D-3D Registration	Jun 17, 2022	BenchmarkingDepth Estimation	—Unverified
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance	Aug 4, 2024	Action AnticipationBenchmarking	—Unverified
ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task	Apr 27, 2023	ArticlesBenchmarking	—Unverified
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics	Apr 21, 2022	AttributeBenchmarking	—Unverified
Distribution-Based Invariant Deep Networks for Learning Meta-Features	Jun 24, 2020	BenchmarkingGeneral Classification	—Unverified
Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories	Nov 7, 2022	3D Reconstruction4D reconstruction	—Unverified
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics	Sep 17, 2021	AttributeBenchmarking	—Unverified
ChatGPT Alternative Solutions: Large Language Models Survey	Mar 21, 2024	BenchmarkingChatbot	—Unverified
Commute Graph Neural Networks	Jun 30, 2024	Benchmarking	—Unverified
An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets	Dec 2, 2023	Benchmarking	—Unverified
Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts	May 23, 2025	Benchmarking	—Unverified
Distributed Training Large-Scale Deep Architectures	Aug 10, 2017	BenchmarkingDeep Learning	—Unverified
Sensitivity analysis and experimental evaluation of PID-like continuous sliding mode control	Aug 13, 2022	BenchmarkingSensitivity	—Unverified

Show:10 25 50

← PrevPage 36 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified