Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 5401–5450 of 5548 papers

Title	Date	Tasks	Status
AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals	May 21, 2025	BenchmarkingChatbot	—Unverified
Exploration of TPUs for AI Applications	Sep 16, 2023	BenchmarkingEdge-computing	—Unverified
Exploring and Benchmarking the Planning Capabilities of Large Language Models	Jun 18, 2024	BenchmarkingIn-Context Learning	—Unverified
Exploring Capabilities of Time Series Foundation Models in Building Analytics	Oct 28, 2024	Benchmarkingenergy management	—Unverified
A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management	Nov 29, 2017	BenchmarkingDeep Reinforcement Learning	—Unverified
Exploring Continual Learning of Diffusion Models	Mar 27, 2023	BenchmarkingContinual Learning	—Unverified
Capsa: A Unified Framework for Quantifying Risk in Deep Neural Networks	Aug 1, 2023	Benchmarking	—Unverified
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era	Mar 16, 2025	BenchmarkingImage Captioning	—Unverified
Airport Capacity and Performance in Europe -- A study of transport economics, service quality and sustainability	Feb 4, 2021	Benchmarking	—Unverified
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation	Feb 10, 2025	Benchmarking	—Unverified
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment	Oct 11, 2024	BenchmarkingReinforcement Learning (RL)	—Unverified
Exploring the Adversarial Frontier: Quantifying Robustness via Adversarial Hypervolume	Mar 8, 2024	Adversarial RobustnessBenchmarking	—Unverified
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance	Jun 18, 2024	Benchmarking	—Unverified
Visual Attention on the Sun: What Do Existing Models Actually Predict?	Nov 25, 2018	BenchmarkingDeep Attention	—Unverified
Exploring Thermography Technology: A Comprehensive Facial Dataset for Face Detection, Recognition, and Emotion	May 28, 2024	BenchmarkingEmotion Recognition	—Unverified
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs	Feb 16, 2025	Benchmarking	—Unverified
Can time series forecasting be automated? A benchmark and analysis	Jul 23, 2024	BenchmarkingDecision Making	—Unverified
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning	Jun 16, 2024	BenchmarkingMath	—Unverified
Can Machines “Learn” Halide Perovskite Crystal Formation without Accurate Physicochemical Features?	May 26, 2020	Benchmarking	—Unverified
Extended Labeled Faces in-the-Wild (ELFW): Augmenting Classes for Face Segmentation	Jun 24, 2020	BenchmarkingData Augmentation	—Unverified
Extensible Logging and Empirical Attainment Function for IOHexperimenter	Sep 28, 2021	Benchmarking	—Unverified
Extraction of clinical information from the non-invasive fetal electrocardiogram	May 27, 2016	BenchmarkingHeart Rate Variability	—Unverified
Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis	Aug 22, 2024	Benchmarking	—Unverified
ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content	Mar 13, 2025	BenchmarkingImage Generation	—Unverified
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning	Apr 19, 2024	Benchmarkingcounterfactual	—Unverified
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates	May 28, 2025	BenchmarkingDiversity	—Unverified
Face Detection on Surveillance Images	Oct 22, 2019	BenchmarkingFace Detection	—Unverified
Face Morphing Attack Generation & Detection: A Comprehensive Survey	Nov 3, 2020	BenchmarkingFace Recognition	—Unverified
FACT: Learning Governing Abstractions Behind Integer Sequences	Sep 20, 2022	Benchmarking	—Unverified
FactLens: Benchmarking Fine-Grained Fact Verification	Nov 8, 2024	BenchmarkingFact Verification	—Unverified
Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations	Dec 23, 2024	BenchmarkingQuestion Answering	—Unverified
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets	Apr 28, 2025	ArticlesBenchmarking	—Unverified
TDDBench: A Benchmark for Training data detection	Nov 5, 2024	BenchmarkingComputational Efficiency	—Unverified
A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System	May 3, 2024	BenchmarkingCollaborative Filtering	—Unverified
FAIRification of MLC data	Nov 23, 2022	BenchmarkingManagement	—Unverified
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind	May 18, 2025	BenchmarkingScene Understanding	—Unverified
FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs	Oct 25, 2024	BenchmarkingFairness	—Unverified
Fairness-Aware Graph Neural Networks: A Survey	Jul 8, 2023	BenchmarkingFairness	—Unverified
Fairness Index Measures to Evaluate Bias in Biometric Recognition	Jun 19, 2023	BenchmarkingFairness	—Unverified
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs	May 26, 2025	BenchmarkingLarge Language Model	—Unverified
FakeWatch ElectionShield: A Benchmarking Framework to Detect Fake News for Credible US Elections	Nov 27, 2023	ArticlesBenchmarking	—Unverified
TeamTrack: A Dataset for Multi-Sport Multi-Object Tracking in Full-pitch Videos	Apr 22, 2024	BenchmarkingMulti-Object Tracking	—Unverified
Teaspoon: A comprehensive python package for topological signal processing	Oct 10, 2020	BenchmarkingTopological Data Analysis	—Unverified
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning	May 12, 2025	16kBenchmarking	—Unverified
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension	Nov 16, 2021	BenchmarkingQuestion Answering	—Unverified
Can Language Models Serve as Text-Based World Simulators?	Jun 10, 2024	BenchmarkingDecision Making	—Unverified
Fantastic Questions and Where to Find Them: FairytaleQA – An Authentic Dataset for Narrative Comprehension	May 1, 2022	BenchmarkingQuestion Answering	—Unverified
FarsBase-KBP: A Knowledge Base Population System for the Persian Knowledge Graph	May 4, 2020	BenchmarkingEntity Linking	—Unverified
Can humans help BERT gain "confidence"?	Aug 31, 2023	BenchmarkingEEG	—Unverified
Technical report of a DMD-based Characterization Method for Vision Sensors	Mar 4, 2025	BenchmarkingDataset Generation	—Unverified

Show:10 25 50

← PrevPage 109 of 111Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified