| When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks | Apr 2, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |
| Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | May 25, 2025 | BenchmarkingMulti-Agent Path Finding | —Unverified | 0 |
| Which models are innately best at uncertainty estimation? | Jun 5, 2022 | BenchmarkingOut-of-Distribution Detection | —Unverified | 0 |
| White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs | Apr 16, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| Who Said That? Benchmarking Social Media AI Detection | Oct 12, 2023 | BenchmarkingMisinformation | —Unverified | 0 |
| Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice | Feb 29, 2020 | BenchmarkingHoldout Set | —Unverified | 0 |
| Why every GBDT speed benchmark is wrong | Oct 24, 2018 | Benchmarking | —Unverified | 0 |
| Why is the winner the best? | Mar 30, 2023 | BenchmarkingMulti-Task Learning | —Unverified | 0 |
| WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution | Apr 28, 2025 | BenchmarkingImage Attribution | —Unverified | 0 |