Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1101–1125 of 5548 papers

Title	Date	Tasks	Status	Hype	Score
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning	Dec 11, 2024	AttributeBenchmarking	CodeCode Available	1	5
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering	May 22, 2025	BenchmarkingEvidence Selection	CodeCode Available	1	5
2.5D Visual Relationship Detection	Apr 26, 2021	BenchmarkingDepth Estimation	CodeCode Available	1	5
Benchmarking Robustness to Adversarial Image Obfuscations	Jan 30, 2023	Benchmarking	CodeCode Available	1	5
Benchmarking Robustness of Text-Image Composed Retrieval	Nov 24, 2023	AttributeBenchmarking	CodeCode Available	1	5
Benchmarking Robustness of Machine Reading Comprehension Models	Apr 29, 2020	BenchmarkingMachine Reading Comprehension	CodeCode Available	1	5
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code	Jun 22, 2022	BenchmarkingText Generation	CodeCode Available	1	5
Benchmarking saliency methods for chest X-ray interpretation	Oct 10, 2022	BenchmarkingDecision Making	CodeCode Available	1	5
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset	Jun 5, 2023	BenchmarkingMultiple-choice	CodeCode Available	1	5
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification	Jun 20, 2024	BenchmarkingClassification	CodeCode Available	1	5
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift	Dec 15, 2022	BenchmarkingImage Captioning	CodeCode Available	1	5
Benchmarking Self-Supervised Learning on Diverse Pathology Datasets	Dec 9, 2022	BenchmarkingClassification	CodeCode Available	1	5
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards	May 7, 2025	BenchmarkingHallucination	CodeCode Available	1	5
Benchmarking Segmentation Models with Mask-Preserved Attribute Editing	Mar 2, 2024	AttributeBenchmarking	CodeCode Available	1	5
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions	Feb 28, 2024	BenchmarkingMultiple-choice	CodeCode Available	1	5
GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras	Jun 11, 2025	Benchmarking	CodeCode Available	1	5
Benchmarking Simulation-Based Inference	Jan 12, 2021	Benchmarking	CodeCode Available	1	5
Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking Sequences	May 28, 2024	BenchmarkingFeature Engineering	CodeCode Available	1	5
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling	Jul 13, 2024	BenchmarkingMath	CodeCode Available	1	5
Benchmarking LLMs for Political Science: A United Nations Perspective	Feb 19, 2025	BenchmarkingDecision Making	CodeCode Available	1	5
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions	Jan 1, 2024	BenchmarkingInstruction Following	CodeCode Available	1	5
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph	May 23, 2025	BenchmarkingManagement	CodeCode Available	1	5
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering	May 25, 2025	AnatomyBenchmarking	CodeCode Available	1	5
Grad DFT: a software library for machine learning enhanced density functional theory	Sep 23, 2023	Benchmarking	CodeCode Available	1	5
Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science Domains	May 23, 2021	Active LearningBayesian Optimisation	CodeCode Available	1	5

Show:10 25 50

← PrevPage 45 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified