Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1751–1800 of 5548 papers

Title	Date	Tasks	Status	Score
Knowledge-Driven Slot Constraints for Goal-Oriented Dialogue Systems	Jun 1, 2021	BenchmarkingGoal-Oriented Dialogue Systems	CodeCode Available	5
Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual Navigation	Jun 2, 2019	BenchmarkingDeep Reinforcement Learning	CodeCode Available	5
Can a single neuron learn predictive uncertainty?	Jun 7, 2021	BenchmarkingConformal Prediction	CodeCode Available	5
JATE 2.0: Java Automatic Term Extraction with Apache Solr	May 1, 2016	BenchmarkingTerm Extraction	CodeCode Available	5
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence Reasoning	Jun 9, 2025	BenchmarkingDiagnostic	CodeCode Available	5
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models	May 23, 2025	BenchmarkingDiversity	CodeCode Available	5
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs	May 29, 2025	BenchmarkingFairness	CodeCode Available	5
COCO: Performance Assessment	May 11, 2016	Benchmarking	CodeCode Available	5
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs	Apr 10, 2024	Benchmarkingknowledge editing	CodeCode Available	5
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models	Jun 10, 2024	BenchmarkingCode Generation	CodeCode Available	5
Analyzing the Feature Extractor Networks for Face Image Synthesis	Jun 4, 2024	BenchmarkingImage Generation	CodeCode Available	5
Mamba-Based Ensemble learning for White Blood Cell Classification	Apr 15, 2025	BenchmarkingClassification	CodeCode Available	5
Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation Oncology	Apr 24, 2023	BenchmarkingDecision Making	CodeCode Available	5
JExplore: Design Space Exploration Tool for Nvidia Jetson Boards	Feb 16, 2025	BenchmarkingGPU	CodeCode Available	5
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model	Jul 31, 2024	BenchmarkingLarge Language Model	CodeCode Available	5
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical Images	Oct 22, 2024	BenchmarkingSelf-Supervised Learning	CodeCode Available	5
Calibrating Pre-trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refinement	May 26, 2025	Benchmarking	CodeCode Available	5
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking	May 16, 2025	Benchmarking	CodeCode Available	5
IoT Data Trust Evaluation via Machine Learning	Aug 15, 2023	BenchmarkingTime Series	CodeCode Available	5
Calibrated Adaptive Probabilistic ODE Solvers	Dec 15, 2020	BenchmarkingDescriptive	CodeCode Available	5
IOLBENCH: Benchmarking LLMs on Linguistic Reasoning	Jan 8, 2025	Benchmarking	CodeCode Available	5
IPC: A Benchmark Data Set for Learning with Graph-Structured Data	May 15, 2019	BenchmarkingGraph Classification	CodeCode Available	5
Knowledge Enhanced Conditional Imputation for Healthcare Time-series	Dec 27, 2023	BenchmarkingImputation	CodeCode Available	5
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance	Sep 22, 2024	AutoMLBenchmarking	CodeCode Available	5
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge	Apr 10, 2025	Adversarial RobustnessBenchmarking	CodeCode Available	5
Cable Tree Wiring -- Benchmarking Solvers on a Real-World Scheduling Problem with a Variety of Precedence Constraints	Nov 25, 2020	BenchmarkingScheduling	CodeCode Available	5
Inverse Contextual Bandits: Learning How Behavior Evolves over Time	Jul 13, 2021	BenchmarkingDecision Making	CodeCode Available	5
Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM	Oct 8, 2014	Benchmarking	CodeCode Available	5
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions	Oct 18, 2023	BenchmarkingVisual Grounding	CodeCode Available	5
B-XAIC Dataset: Benchmarking Explainable AI for Graph Neural Networks Using Chemical Data	May 28, 2025	BenchmarkingDrug Discovery	CodeCode Available	5
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition	Jun 10, 2024	BenchmarkingEmotion Recognition	CodeCode Available	5
Analysis \| OPEN \| Published: 17 June 2019 Multitask learning and benchmarking with clinical time series data	Jun 17, 2019	BenchmarkingBIG-bench Machine Learning	CodeCode Available	5
Building Conformal Prediction Intervals with Approximate Message Passing	Oct 21, 2024	BenchmarkingConformal Prediction	CodeCode Available	5
Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting	May 7, 2021	BenchmarkingDeep Learning	CodeCode Available	5
Adaptive Visual Scene Understanding: Incremental Scene Graph Generation	Oct 2, 2023	BenchmarkingContinual Learning	CodeCode Available	5
Integrating Expert Knowledge into Logical Programs via LLMs	Feb 17, 2025	BenchmarkingLogical Reasoning	CodeCode Available	5
Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark	May 9, 2016	BenchmarkingEmotion Recognition	CodeCode Available	5
ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance	Jan 17, 2025	BenchmarkingMulti-agent Reinforcement Learning	CodeCode Available	5
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models	Mar 11, 2025	BenchmarkingHyperparameter Optimization	CodeCode Available	5
Bugs in the Data: How ImageNet Misrepresents Biodiversity	Aug 24, 2022	BenchmarkingObject Detection	CodeCode Available	5
CleanPatrick: A Benchmark for Image Data Cleaning	May 16, 2025	BenchmarkingLabel Error Detection	CodeCode Available	5
BubGAN: Bubble Generative Adversarial Networks for Synthesizing Realistic Bubbly Flow Images	Sep 7, 2018	Benchmarking	CodeCode Available	5
InstaIndoor and Multi-modal Deep Learning for Indoor Scene Recognition	Dec 23, 2021	BenchmarkingDeep Learning	CodeCode Available	5
bsnsing: A decision tree induction method based on recursive optimal boolean rule composition	May 30, 2022	Benchmarking	CodeCode Available	5
BSBench: will your LLM find the largest prime number?	Jun 5, 2025	Benchmarking	CodeCode Available	5
Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories	Apr 10, 2025	Additive modelsBenchmarking	CodeCode Available	5
inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences	Jun 25, 2025	Benchmarking	CodeCode Available	5
Towards Learning Universal, Regional, and Local Hydrological Behaviors via Machine-Learning Applied to Large-Sample Datasets	Jul 19, 2019	BenchmarkingBIG-bench Machine Learning	CodeCode Available	5
Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation	Apr 29, 2025	BenchmarkingFairness	CodeCode Available	5
Adaptive Power System Emergency Control using Deep Reinforcement Learning	Mar 9, 2019	BenchmarkingDeep Reinforcement Learning	CodeCode Available	5

Show:10 25 50

← PrevPage 36 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified