Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 4351–4375 of 5548 papers

Title	Date	Tasks	Status
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks	Apr 2, 2025	BenchmarkingLanguage Modeling	—Unverified
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques	May 22, 2025	Benchmarking	—Unverified
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding	May 25, 2025	BenchmarkingMulti-Agent Path Finding	—Unverified
Which models are innately best at uncertainty estimation?	Jun 5, 2022	BenchmarkingOut-of-Distribution Detection	—Unverified
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs	Apr 16, 2024	BenchmarkingLanguage Modelling	—Unverified
Who Said That? Benchmarking Social Media AI Detection	Oct 12, 2023	BenchmarkingMisinformation	—Unverified
Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice	Feb 29, 2020	BenchmarkingHoldout Set	—Unverified
Why every GBDT speed benchmark is wrong	Oct 24, 2018	Benchmarking	—Unverified
Why is the winner the best?	Mar 30, 2023	BenchmarkingMulti-Task Learning	—Unverified
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution	Apr 28, 2025	BenchmarkingImage Attribution	—Unverified
Wildfire Forecasting with Satellite Images and Deep Generative Model	Aug 19, 2022	BenchmarkingVideo Prediction	—Unverified
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences	Jun 16, 2024	BenchmarkingSpatial Reasoning	—Unverified
Window-of-interest based Multi-objective Evolutionary Search for Satisficing Concepts	Jul 4, 2017	Benchmarking	—Unverified
WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data	Sep 17, 2021	BenchmarkingBIG-bench Machine Learning	—Unverified
Word Complexity Estimation for Japanese Lexical Simplification	May 1, 2020	BenchmarkingLexical Simplification	—Unverified
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models	May 14, 2025	Benchmarking	—Unverified
Writing as a testbed for open ended agents	Mar 25, 2025	BenchmarkingDiversity	—Unverified
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods	Feb 5, 2025	Benchmarking	—Unverified
XCSP3: An Integrated Format for Benchmarking Combinatorial Constrained Problems	Nov 10, 2016	Benchmarking	—Unverified
XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis	Jun 26, 2024	Autonomous DrivingBenchmarking	—Unverified
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval	May 28, 2025	BenchmarkingRecommendation Systems	—Unverified
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support	Feb 15, 2025	BenchmarkingEpidemiology	—Unverified
Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease	Sep 21, 2017	BenchmarkingClassification	—Unverified
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain	Jan 23, 2025	BenchmarkingDomain Adaptation	—Unverified
Zero-Forcing Max-Power Beamforming for Hybrid mmWave Full-Duplex MIMO Systems	Feb 29, 2020	Benchmarking	—Unverified

Show:10 25 50

← PrevPage 175 of 222Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified