SOTAVerified

Benchmarking

Papers

Showing 43514360 of 5548 papers

TitleStatusHype
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques0
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding0
Which models are innately best at uncertainty estimation?0
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs0
Who Said That? Benchmarking Social Media AI Detection0
Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice0
Why every GBDT speed benchmark is wrong0
Why is the winner the best?0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution0
Show:102550
← PrevPage 436 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified