| When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks | Apr 2, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |
| Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | May 25, 2025 | BenchmarkingMulti-Agent Path Finding | —Unverified | 0 |
| Which models are innately best at uncertainty estimation? | Jun 5, 2022 | BenchmarkingOut-of-Distribution Detection | —Unverified | 0 |
| White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs | Apr 16, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| Who Said That? Benchmarking Social Media AI Detection | Oct 12, 2023 | BenchmarkingMisinformation | —Unverified | 0 |
| Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice | Feb 29, 2020 | BenchmarkingHoldout Set | —Unverified | 0 |
| Why every GBDT speed benchmark is wrong | Oct 24, 2018 | Benchmarking | —Unverified | 0 |
| Why is the winner the best? | Mar 30, 2023 | BenchmarkingMulti-Task Learning | —Unverified | 0 |
| WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution | Apr 28, 2025 | BenchmarkingImage Attribution | —Unverified | 0 |
| Wildfire Forecasting with Satellite Images and Deep Generative Model | Aug 19, 2022 | BenchmarkingVideo Prediction | —Unverified | 0 |
| WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences | Jun 16, 2024 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Window-of-interest based Multi-objective Evolutionary Search for Satisficing Concepts | Jul 4, 2017 | Benchmarking | —Unverified | 0 |
| WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data | Sep 17, 2021 | BenchmarkingBIG-bench Machine Learning | —Unverified | 0 |
| Word Complexity Estimation for Japanese Lexical Simplification | May 1, 2020 | BenchmarkingLexical Simplification | —Unverified | 0 |
| WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models | May 14, 2025 | Benchmarking | —Unverified | 0 |
| Writing as a testbed for open ended agents | Mar 25, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods | Feb 5, 2025 | Benchmarking | —Unverified | 0 |
| XCSP3: An Integrated Format for Benchmarking Combinatorial Constrained Problems | Nov 10, 2016 | Benchmarking | —Unverified | 0 |
| XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis | Jun 26, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval | May 28, 2025 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support | Feb 15, 2025 | BenchmarkingEpidemiology | —Unverified | 0 |
| Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease | Sep 21, 2017 | BenchmarkingClassification | —Unverified | 0 |
| You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain | Jan 23, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| Zero-Forcing Max-Power Beamforming for Hybrid mmWave Full-Duplex MIMO Systems | Feb 29, 2020 | Benchmarking | —Unverified | 0 |