Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 2651–2700 of 5548 papers

Title	Date	Tasks	Status
Fantastic Questions and Where to Find Them: FairytaleQA – An Authentic Dataset for Narrative Comprehension	May 1, 2022	BenchmarkingQuestion Answering	—Unverified
AI PERSONA: Towards Life-long Personalization of LLMs	Dec 17, 2024	Benchmarking	—Unverified
Foundations for learning from noisy quantum experiments	Apr 28, 2022	Benchmarking	—Unverified
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate	May 28, 2025	Benchmarking	—Unverified
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension	Nov 16, 2021	BenchmarkingQuestion Answering	—Unverified
Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models	Feb 9, 2025	BenchmarkingCode Generation	—Unverified
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning	May 12, 2025	16kBenchmarking	—Unverified
Framework and Benchmarks for Combinatorial and Mixed-variable Bayesian Optimization	Jun 16, 2023	Bayesian OptimizationBenchmarking	—Unverified
FRED: The Florence RGB-Event Drone Dataset	Jun 5, 2025	BenchmarkingTrajectory Forecasting	—Unverified
Benchmarking projective simulation in navigation problems	Apr 23, 2018	BenchmarkingQ-Learning	—Unverified
Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification	May 24, 2024	BenchmarkingData Augmentation	—Unverified
Benchmarking Single-Image Reflection Removal Algorithms	Oct 1, 2017	BenchmarkingReflection Removal	—Unverified
A Survey on LLM-based News Recommender Systems	Feb 13, 2025	BenchmarkingFairness	—Unverified
How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study	Dec 25, 2024	BenchmarkingCode Generation	—Unverified
Human Body Shape Classification Based on a Single Image	May 29, 2023	BenchmarkingClassification	—Unverified
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems	Oct 24, 2024	BenchmarkingCommon Sense Reasoning	—Unverified
Benchmarking SMT Performance for Farsi Using the TEP++ Corpus	May 1, 2015	BenchmarkingMachine Translation	—Unverified
From Code to Play: Benchmarking Program Search for Games Using Large Language Models	Dec 5, 2024	Atari GamesBenchmarking	—Unverified
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks	Apr 14, 2022	Adversarial AttackAdversarial Robustness	—Unverified
From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT	May 17, 2024	BenchmarkingMultiple-choice	—Unverified
Benchmarking Processor Performance by Multi-Threaded Machine Learning Algorithms	Sep 11, 2021	BenchmarkingBIG-bench Machine Learning	—Unverified
FakeWatch ElectionShield: A Benchmarking Framework to Detect Fake News for Credible US Elections	Nov 27, 2023	ArticlesBenchmarking	—Unverified
A Large-scale Evaluation of Pretraining Paradigms for the Detection of Defects in Electroluminescence Solar Cell Images	Feb 27, 2024	BenchmarkingDefect Detection	—Unverified
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future	Aug 5, 2024	BenchmarkingCode Generation	—Unverified
How Good is a Video Summary? A New Benchmarking Dataset and Evaluation Framework Towards Realistic Video Summarization	Jan 26, 2021	BenchmarkingSupervised Video Summarization	—Unverified
Benchmarking Spiking Neural Network Learning Methods with Varying Locality	Feb 1, 2024	Benchmarking	—Unverified
Fairness Index Measures to Evaluate Bias in Biometric Recognition	Jun 19, 2023	BenchmarkingFairness	—Unverified
Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making	Jun 25, 2024	BenchmarkingDecision Making	—Unverified
Fairness-Aware Graph Neural Networks: A Survey	Jul 8, 2023	BenchmarkingFairness	—Unverified
From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution	Apr 9, 2024	Benchmarking	—Unverified
Benchmarking State-of-the-Art Deep Learning Software Tools	Aug 25, 2016	BenchmarkingCPU	—Unverified
From Sound Representation to Model Robustness	Jul 27, 2020	Adversarial AttackAdversarial Robustness	—Unverified
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs	Oct 25, 2024	BenchmarkingFairness	—Unverified
Benchmarking state-of-the-art gradient boosting algorithms for classification	May 26, 2023	Bayesian OptimizationBenchmarking	—Unverified
Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images	Dec 12, 2023	BenchmarkingRetrieval	—Unverified
FSD-10: A Dataset for Competitive Sports Content Analysis	Feb 9, 2020	Action RecognitionBenchmarking	—Unverified
FAIRification of MLC data	Nov 23, 2022	BenchmarkingManagement	—Unverified
A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking	Sep 5, 2023	BenchmarkingKnowledge Distillation	—Unverified
How Good Is Neural Combinatorial Optimization? A Systematic Evaluation on the Traveling Salesman Problem	Sep 22, 2022	BenchmarkingCombinatorial Optimization	—Unverified
Full-scale modal testing of a Hawk T1A aircraft for benchmarking vibration-based methods	Oct 6, 2023	BenchmarkingExperimental Design	—Unverified
Full-stack evaluation of Machine Learning inference workloads for RISC-V systems	May 24, 2024	BenchmarkingDeep Learning	—Unverified
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models	Feb 21, 2024	BenchmarkingImage to text	—Unverified
FunBench: Benchmarking Fundus Reading Skills of MLLMs	Mar 2, 2025	AnatomyBenchmarking	—Unverified
Functional Code Building Genetic Programming	Jun 9, 2022	BenchmarkingProgram Synthesis	—Unverified
Efficient Pauli channel estimation with logarithmic quantum memory	Sep 25, 2023	Benchmarking	—Unverified
A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System	May 3, 2024	BenchmarkingCollaborative Filtering	—Unverified
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage	Oct 23, 2024	Benchmarking	—Unverified
Fuzzy Knowledge Distillation from High-Order TSK to Low-Order TSK	Feb 16, 2023	BenchmarkingKnowledge Distillation	—Unverified
A Survey of Spanish Clinical Language Models	Aug 4, 2023	BenchmarkingSurvey	—Unverified
AI Matrix - Synthetic Benchmarks for DNN	Nov 27, 2018	BenchmarkingCPU	—Unverified

Show:10 25 50

← PrevPage 54 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified