Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1376–1400 of 5548 papers

Title	Date	Tasks	Status	Hype
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks	Oct 23, 2023	Benchmarking	CodeCode Available	1
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning	Feb 22, 2024	Benchmarking	CodeCode Available	1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)	Nov 19, 2022	BenchmarkingC++ code	CodeCode Available	1
Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient	Jul 3, 2020	BenchmarkingMuJoCo	CodeCode Available	1
MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts	Feb 14, 2022	Benchmarking	CodeCode Available	1
Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital Pathology	Jun 30, 2022	BenchmarkingDiagnostic	CodeCode Available	1
DACBench: A Benchmark Library for Dynamic Algorithm Configuration	May 18, 2021	Benchmarking	CodeCode Available	1
Benchmarking Image Retrieval for Visual Localization	Nov 24, 2020	Autonomous DrivingBenchmarking	CodeCode Available	1
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection	May 30, 2022	3D Object DetectionAutonomous Driving	CodeCode Available	1
ArabicaQA: A Comprehensive Dataset for Arabic Question Answering	Mar 26, 2024	BenchmarkingMachine Reading Comprehension	CodeCode Available	1
MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs	Jan 1, 2023	BenchmarkingGPU	CodeCode Available	1
COVID-19 event extraction from Twitter via extractive question answering with continuous prompts	Mar 19, 2023	BenchmarkingEvent Extraction	CodeCode Available	1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents	May 26, 2025	BenchmarkingMinecraft	CodeCode Available	1
minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models	Mar 24, 2022	BenchmarkingSentence	CodeCode Available	1
Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions	Oct 13, 2021	BenchmarkingComputational Efficiency	CodeCode Available	1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation	Dec 26, 2019	BenchmarkingDomain Adaptation	CodeCode Available	1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets	Dec 10, 2021	Benchmarking	CodeCode Available	1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans	Jan 14, 2021	BenchmarkingMedical Diagnosis	CodeCode Available	1
MLLM-DataEngine: An Iterative Refinement Approach for MLLM	Aug 25, 2023	Benchmarking	CodeCode Available	1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions	Jun 26, 2025	BenchmarkingDrug Design	CodeCode Available	1
CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling	Oct 14, 2022	BenchmarkingLanguage Modeling	CodeCode Available	1
ByzFL: Research Framework for Robust Federated Learning	May 30, 2025	BenchmarkingFederated Learning	CodeCode Available	1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning	May 30, 2024	Autonomous DrivingBenchmarking	CodeCode Available	1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasks	Feb 4, 2023	Adversarial AttackAdversarial Robustness	CodeCode Available	1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data	Jun 10, 2025	BenchmarkingData Augmentation	CodeCode Available	1

Show:10 25 50

← PrevPage 56 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified