Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1051–1100 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Benchmarking Segmentation Models with Mask-Preserved Attribute Editing	Mar 2, 2024	AttributeBenchmarking	CodeCode Available	1	5
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios	Oct 25, 2024	BenchmarkingDiversity	CodeCode Available	1	5
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning	Sep 27, 2024	AutoMLBenchmarking	CodeCode Available	1	5
GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks	Mar 23, 2025	BenchmarkingHallucination	CodeCode Available	1	5
Benchmarking saliency methods for chest X-ray interpretation	Oct 10, 2022	BenchmarkingDecision Making	CodeCode Available	1	5
3D Common Corruptions and Data Augmentation	Mar 2, 2022	BenchmarkingData Augmentation	CodeCode Available	1	5
GenISP: Neural ISP for Low-Light Machine Cognition	May 7, 2022	BenchmarkingImage Restoration	CodeCode Available	1	5
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents	Apr 9, 2024	Benchmarking	CodeCode Available	1	5
Benchmarking Self-Supervised Learning on Diverse Pathology Datasets	Dec 9, 2022	BenchmarkingClassification	CodeCode Available	1	5
Geoclidean: Few-Shot Generalization in Euclidean Geometry	Nov 30, 2022	Benchmarking	CodeCode Available	1	5
Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking Platform	Jul 15, 2020	ArticlesBenchmarking	CodeCode Available	1	5
Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networks	Dec 30, 2021	BenchmarkingHeterogeneous Node Classification	CodeCode Available	1	5
From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image Segmentation	Mar 3, 2025	BenchmarkingComputational Efficiency	CodeCode Available	1	5
Benchmarking Robustness of Text-Image Composed Retrieval	Nov 24, 2023	AttributeBenchmarking	CodeCode Available	1	5
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios	May 22, 2025	BenchmarkingInstruction Following	CodeCode Available	1	5
Benchmarking Robustness to Adversarial Image Obfuscations	Jan 30, 2023	Benchmarking	CodeCode Available	1	5
GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument Roles	May 25, 2022	BenchmarkingEvent Argument Extraction	CodeCode Available	1	5
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs	Nov 29, 2023	Benchmarking	CodeCode Available	1	5
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering	May 25, 2025	AnatomyBenchmarking	CodeCode Available	1	5
Generative Evaluation of Complex Reasoning in Large Language Models	Apr 3, 2025	BenchmarkingMemorization	CodeCode Available	1	5
Benchmarking Robustness of Machine Reading Comprehension Models	Apr 29, 2020	BenchmarkingMachine Reading Comprehension	CodeCode Available	1	5
3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding	Mar 30, 2021	Affordance DetectionBenchmarking	CodeCode Available	1	5
Generative CKM Construction using Partially Observed Data with Diffusion Model	Dec 19, 2024	Benchmarking	CodeCode Available	1	5
Generative Wind Power Curve Modeling Via Machine Vision: A Self-learning Deep Convolutional Network Based Method	Aug 19, 2021	BenchmarkingSynthetic Data Generation	CodeCode Available	1	5
GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning	Feb 3, 2024	BenchmarkingDeepFake Detection	CodeCode Available	1	5
German's Next Language Model	Oct 21, 2020	BenchmarkingDocument Classification	CodeCode Available	1	5
Benchmarking Robustness of 3D Object Detection to Common Corruptions	Jan 1, 2023	3D Object DetectionAutonomous Driving	CodeCode Available	1	5
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering	May 22, 2025	BenchmarkingEvidence Selection	CodeCode Available	1	5
Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study	Feb 26, 2025	BenchmarkingBlood pressure estimation	CodeCode Available	1	5
A Review and Efficient Implementation of Scene Graph Generation Metrics	Apr 15, 2024	BenchmarkingGraph Generation	CodeCode Available	1	5
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models	Jun 1, 2024	Benchmarking	CodeCode Available	1	5
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining	Nov 22, 2017	Benchmarkingfeature selection	CodeCode Available	1	5
2.5D Visual Relationship Detection	Apr 26, 2021	BenchmarkingDepth Estimation	CodeCode Available	1	5
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design	Jun 24, 2024	BenchmarkingDrug Design	CodeCode Available	1	5
Generating a Doppelganger Graph: Resembling but Distinct	Jan 23, 2021	BenchmarkingGraph Representation Learning	CodeCode Available	1	5
GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts	Oct 12, 2023	Benchmarking	CodeCode Available	1	5
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph	May 23, 2025	BenchmarkingManagement	CodeCode Available	1	5
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code	Jun 22, 2022	BenchmarkingText Generation	CodeCode Available	1	5
GAMA: a General Automated Machine learning Assistant	Jul 9, 2020	AutoMLBenchmarking	CodeCode Available	1	5
GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided Gastrointestinal Disease Detection	Jul 16, 2023	Benchmarking	CodeCode Available	1	5
G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural Networks	Sep 29, 2023	Benchmarking	CodeCode Available	1	5
Benchmarking Quantized Neural Networks on FPGAs with FINN	Feb 2, 2021	BenchmarkingQuantization	CodeCode Available	1	5
GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection	Jun 21, 2023	Anomaly DetectionBenchmarking	CodeCode Available	1	5
GCondenser: Benchmarking Graph Condensation	May 23, 2024	BenchmarkingGraph Representation Learning	CodeCode Available	1	5
Benchmarking emergency department triage prediction models with machine learning and large public electronic health records	Nov 22, 2021	Benchmarking	CodeCode Available	1	5
FTNet: Feature Transverse Network for Thermal Image Semantic Segmentation	Oct 26, 2021	BenchmarkingScene Segmentation	CodeCode Available	1	5
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset	Jun 5, 2023	BenchmarkingMultiple-choice	CodeCode Available	1	5
Benchmarking Large Multimodal Models against Common Corruptions	Jan 22, 2024	BenchmarkingImage to text	CodeCode Available	1	5
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification	Jun 20, 2024	BenchmarkingClassification	CodeCode Available	1	5
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow	May 23, 2025	BenchmarkingCode Generation	CodeCode Available	1	5

Show:10 25 50

← PrevPage 22 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified