SOTAVerified

Benchmarking

Papers

Showing 43514375 of 5548 papers

TitleStatusHype
When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks0
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques0
Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding0
Which models are innately best at uncertainty estimation?0
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs0
Who Said That? Benchmarking Social Media AI Detection0
Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice0
Why every GBDT speed benchmark is wrong0
Why is the winner the best?0
WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution0
Wildfire Forecasting with Satellite Images and Deep Generative Model0
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences0
Window-of-interest based Multi-objective Evolutionary Search for Satisficing Concepts0
WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data0
Word Complexity Estimation for Japanese Lexical Simplification0
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models0
Writing as a testbed for open ended agents0
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods0
XCSP3: An Integrated Format for Benchmarking Combinatorial Constrained Problems0
XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis0
Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval0
Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support0
Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease0
You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain0
Zero-Forcing Max-Power Beamforming for Hybrid mmWave Full-Duplex MIMO Systems0
Show:102550
← PrevPage 175 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified