Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1351–1400 of 5548 papers

Title	Date	Tasks	Status	Hype
A framework for benchmarking class-out-of-distribution detection and its application to ImageNet	Feb 23, 2023	BenchmarkingKnowledge Distillation	CodeCode Available	1
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation	May 17, 2025	BenchmarkingQuestion Answering	CodeCode Available	1
DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems	Oct 30, 2024	BenchmarkingManagement	CodeCode Available	1
Benchmarking Test-Time Adaptation against Distribution Shifts in Image Classification	Jul 6, 2023	BenchmarkingDomain Adaptation	CodeCode Available	1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games	Aug 25, 2021	BenchmarkingVideo Classification	CodeCode Available	1
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?	Sep 29, 2023	BenchmarkingKnowledge Graph Completion	CodeCode Available	1
Machine Learning Methods for Brain Network Classification: Application to Autism Diagnosis using Cortical Morphological Networks	Apr 28, 2020	BenchmarkingBIG-bench Machine Learning	CodeCode Available	1
Machine Learning with Knowledge Constraints for Process Optimization of Open-Air Perovskite Solar Cell Manufacturing	Oct 1, 2021	Bayesian OptimizationBenchmarking	CodeCode Available	1
Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets	Apr 11, 2022	Action Triplet RecognitionBenchmarking	CodeCode Available	1
DependEval: Benchmarking LLMs for Repository Dependency Understanding	Mar 9, 2025	BenchmarkingCode Generation	CodeCode Available	1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models	Apr 22, 2024	BenchmarkingWorld Knowledge	CodeCode Available	1
Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs	Sep 18, 2021	BenchmarkingComplex Query Answering	CodeCode Available	1
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA	Dec 29, 2023	AnatomyBenchmarking	CodeCode Available	1
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects	Oct 3, 2024	BenchmarkingImitation Learning	CodeCode Available	1
Curious Hierarchical Actor-Critic Reinforcement Learning	May 7, 2020	BenchmarkingHierarchical Reinforcement Learning	CodeCode Available	1
MatText: Do Language Models Need More than Text & Scale for Materials Modeling?	Jun 25, 2024	Benchmarking	CodeCode Available	1
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models	Jan 2, 2025	BenchmarkingComputer Security	CodeCode Available	1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer	Dec 2, 2021	BenchmarkingOrdinal Classification	CodeCode Available	1
MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition	Jul 12, 2021	BenchmarkingChinese Named Entity Recognition	CodeCode Available	1
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT	Jun 13, 2024	BenchmarkingLLM-generated Text Detection	CodeCode Available	1
D2S: Document-to-Slide Generation Via Query-Based Text Summarization	May 8, 2021	BenchmarkingLong Form Question Answering	CodeCode Available	1
MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence using Federated Evaluation	Sep 29, 2021	BenchmarkingPhilosophy	CodeCode Available	1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models	Jun 24, 2024	BenchmarkingData Augmentation	CodeCode Available	1
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation	Oct 11, 2024	BenchmarkingImage Segmentation	CodeCode Available	1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM	Mar 28, 2024	Benchmarking	CodeCode Available	1
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks	Oct 23, 2023	Benchmarking	CodeCode Available	1
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning	Feb 22, 2024	Benchmarking	CodeCode Available	1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)	Nov 19, 2022	BenchmarkingC++ code	CodeCode Available	1
Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient	Jul 3, 2020	BenchmarkingMuJoCo	CodeCode Available	1
MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts	Feb 14, 2022	Benchmarking	CodeCode Available	1
Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital Pathology	Jun 30, 2022	BenchmarkingDiagnostic	CodeCode Available	1
DACBench: A Benchmark Library for Dynamic Algorithm Configuration	May 18, 2021	Benchmarking	CodeCode Available	1
Benchmarking Image Retrieval for Visual Localization	Nov 24, 2020	Autonomous DrivingBenchmarking	CodeCode Available	1
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection	May 30, 2022	3D Object DetectionAutonomous Driving	CodeCode Available	1
ArabicaQA: A Comprehensive Dataset for Arabic Question Answering	Mar 26, 2024	BenchmarkingMachine Reading Comprehension	CodeCode Available	1
MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs	Jan 1, 2023	BenchmarkingGPU	CodeCode Available	1
COVID-19 event extraction from Twitter via extractive question answering with continuous prompts	Mar 19, 2023	BenchmarkingEvent Extraction	CodeCode Available	1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents	May 26, 2025	BenchmarkingMinecraft	CodeCode Available	1
minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models	Mar 24, 2022	BenchmarkingSentence	CodeCode Available	1
Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions	Oct 13, 2021	BenchmarkingComputational Efficiency	CodeCode Available	1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation	Dec 26, 2019	BenchmarkingDomain Adaptation	CodeCode Available	1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets	Dec 10, 2021	Benchmarking	CodeCode Available	1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans	Jan 14, 2021	BenchmarkingMedical Diagnosis	CodeCode Available	1
MLLM-DataEngine: An Iterative Refinement Approach for MLLM	Aug 25, 2023	Benchmarking	CodeCode Available	1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions	Jun 26, 2025	BenchmarkingDrug Design	CodeCode Available	1
CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling	Oct 14, 2022	BenchmarkingLanguage Modeling	CodeCode Available	1
ByzFL: Research Framework for Robust Federated Learning	May 30, 2025	BenchmarkingFederated Learning	CodeCode Available	1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning	May 30, 2024	Autonomous DrivingBenchmarking	CodeCode Available	1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasks	Feb 4, 2023	Adversarial AttackAdversarial Robustness	CodeCode Available	1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data	Jun 10, 2025	BenchmarkingData Augmentation	CodeCode Available	1

Show:10 25 50

← PrevPage 28 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified