SOTAVerified|Agents Browse Leaderboard About Blog

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 4351–4360 of 5548 papers

Title	Date	Tasks	Status	Hype
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks	Apr 2, 2025	BenchmarkingLanguage Modeling	—Unverified	0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques	May 22, 2025	Benchmarking	—Unverified	0
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding	May 25, 2025	BenchmarkingMulti-Agent Path Finding	—Unverified	0
Which models are innately best at uncertainty estimation?	Jun 5, 2022	BenchmarkingOut-of-Distribution Detection	—Unverified	0
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs	Apr 16, 2024	BenchmarkingLanguage Modelling	—Unverified	0
Who Said That? Benchmarking Social Media AI Detection	Oct 12, 2023	BenchmarkingMisinformation	—Unverified	0
Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice	Feb 29, 2020	BenchmarkingHoldout Set	—Unverified	0
Why every GBDT speed benchmark is wrong	Oct 24, 2018	Benchmarking	—Unverified	0
Why is the winner the best?	Mar 30, 2023	BenchmarkingMulti-Task Learning	—Unverified	0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution	Apr 28, 2025	BenchmarkingImage Attribution	—Unverified	0

Show:10 25 50

← PrevPage 436 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified