Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 4101–4125 of 5548 papers

Title	Date	Tasks	Status
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests	Sep 22, 2024	Benchmarking	—Unverified
The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics	Aug 1, 2014	BenchmarkingGeneral Classification	—Unverified
The Adversarial AI-Art: Understanding, Generation, Detection, and Benchmarking	Apr 22, 2024	BenchmarkingMisinformation	—Unverified
The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artificial Intelligence	May 14, 2019	Benchmarkingspeech-recognition	—Unverified
Language Models as a Service: Overview of a New Paradigm and its Challenges	Sep 28, 2023	Benchmarking	—Unverified
The Benchmark Lottery	Jul 14, 2021	BenchmarkingBIG-bench Machine Learning	—Unverified
The Brain Tumor Segmentation (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI	Jun 1, 2023	BenchmarkingBrain Tumor Segmentation	—Unverified
The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal	Sep 12, 2024	BenchmarkingLanguage Modeling	—Unverified
The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach	Apr 27, 2025	BenchmarkingDecision Making	—Unverified
The Curious Case of Integrator Reach Sets, Part I: Basic Theory	Feb 23, 2021	Benchmarking	—Unverified
The Design and Implementation of a Scalable DL Benchmarking Platform	Nov 19, 2019	Benchmarking	—Unverified
The Disagreement Problem in Faithfulness Metrics	Nov 13, 2023	BenchmarkingExplainable artificial intelligence	—Unverified
The DLV System for Knowledge Representation and Reasoning	Nov 4, 2002	Benchmarking	—Unverified
The Dota 2 Bot Competition	Mar 4, 2021	BenchmarkingDota 2	—Unverified
The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation	Aug 1, 2021	BenchmarkingMachine Translation	—Unverified
The EuroCity Persons Dataset: A Novel Benchmark for Object Detection	May 18, 2018	BenchmarkingObject	—Unverified
The Evolutionary Computation Methods No One Should Use	Jan 5, 2023	Benchmarking	—Unverified
The Expressive Power of Word Embeddings	Jan 15, 2013	BenchmarkingSentence	—Unverified
The Extractive-Abstractive Axis: Measuring Content "Borrowing" in Generative Language Models	Jul 20, 2023	Benchmarking	—Unverified
The FaceChannelS: Strike of the Sequences for the AffWild 2 Challenge	Oct 4, 2020	BenchmarkingBIG-bench Machine Learning	—Unverified
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input	Jan 6, 2025	BenchmarkingForm	—Unverified
The Forchheim Image Database for Camera Identification in the Wild	Nov 4, 2020	BenchmarkingFact Checking	—Unverified
The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech	Apr 17, 2021	Benchmarking	—Unverified
The Impact of Genomic Variation on Function (IGVF) Consortium	Jul 24, 2023	Benchmarking	—Unverified
The iNaturalist Sounds Dataset	May 31, 2025	Benchmarking	—Unverified

Show:10 25 50

← PrevPage 165 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified