Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2651–2675 of 5548 papers

Title	Date	Tasks	Status
Benchmarking Quality-Diversity Algorithms on Neuroevolution for Reinforcement Learning	Nov 4, 2022	BenchmarkingDiversity	—Unverified
Benchmarking Quality-Dependent and Cost-Sensitive Score-Level Multimodal Biometric Fusion Algorithms	Nov 17, 2021	Benchmarking	—Unverified
Foundations for learning from noisy quantum experiments	Apr 28, 2022	Benchmarking	—Unverified
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate	May 28, 2025	Benchmarking	—Unverified
FarsBase-KBP: A Knowledge Base Population System for the Persian Knowledge Graph	May 4, 2020	BenchmarkingEntity Linking	—Unverified
Fantastic Questions and Where to Find Them: FairytaleQA – An Authentic Dataset for Narrative Comprehension	May 1, 2022	BenchmarkingQuestion Answering	—Unverified
AI PERSONA: Towards Life-long Personalization of LLMs	Dec 17, 2024	Benchmarking	—Unverified
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension	Nov 16, 2021	BenchmarkingQuestion Answering	—Unverified
FRED: The Florence RGB-Event Drone Dataset	Jun 5, 2025	BenchmarkingTrajectory Forecasting	—Unverified
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models	Feb 9, 2025	BenchmarkingCode Generation	—Unverified
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification	May 24, 2024	BenchmarkingData Augmentation	—Unverified
Benchmarking Single-Image Reflection Removal Algorithms	Oct 1, 2017	BenchmarkingReflection Removal	—Unverified
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning	May 12, 2025	16kBenchmarking	—Unverified
From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano	Jul 5, 2024	AttributeBenchmarking	—Unverified
Benchmarking projective simulation in navigation problems	Apr 23, 2018	BenchmarkingQ-Learning	—Unverified
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems	Oct 24, 2024	BenchmarkingCommon Sense Reasoning	—Unverified
A Survey on LLM-based News Recommender Systems	Feb 13, 2025	BenchmarkingFairness	—Unverified
From Code to Play: Benchmarking Program Search for Games Using Large Language Models	Dec 5, 2024	Atari GamesBenchmarking	—Unverified
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks	Apr 14, 2022	Adversarial AttackAdversarial Robustness	—Unverified
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT	May 17, 2024	BenchmarkingMultiple-choice	—Unverified
Holistic Multi-View Building Analysis in the Wild with Projection Pooling	Aug 23, 2020	Benchmarking	—Unverified
How Aligned are Different Alignment Metrics?	Jul 10, 2024	Benchmarking	—Unverified
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images	Feb 27, 2024	BenchmarkingDefect Detection	—Unverified
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future	Aug 5, 2024	BenchmarkingCode Generation	—Unverified
Benchmarking Processor Performance by Multi-Threaded Machine Learning Algorithms	Sep 11, 2021	BenchmarkingBIG-bench Machine Learning	—Unverified

Show:10 25 50

← PrevPage 107 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified