| When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks | Apr 2, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |
| Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | May 25, 2025 | BenchmarkingMulti-Agent Path Finding | —Unverified | 0 |
| Which models are innately best at uncertainty estimation? | Jun 5, 2022 | BenchmarkingOut-of-Distribution Detection | —Unverified | 0 |
| White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs | Apr 16, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| Who Said That? Benchmarking Social Media AI Detection | Oct 12, 2023 | BenchmarkingMisinformation | —Unverified | 0 |
| Who Wins the Game of Thrones? How Sentiments Improve the Prediction of Candidate Choice | Feb 29, 2020 | BenchmarkingHoldout Set | —Unverified | 0 |
| Why every GBDT speed benchmark is wrong | Oct 24, 2018 | Benchmarking | —Unverified | 0 |
| Why is the winner the best? | Mar 30, 2023 | BenchmarkingMulti-Task Learning | —Unverified | 0 |
| WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution | Apr 28, 2025 | BenchmarkingImage Attribution | —Unverified | 0 |
| Wildfire Forecasting with Satellite Images and Deep Generative Model | Aug 19, 2022 | BenchmarkingVideo Prediction | —Unverified | 0 |
| WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences | Jun 16, 2024 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Window-of-interest based Multi-objective Evolutionary Search for Satisficing Concepts | Jul 4, 2017 | Benchmarking | —Unverified | 0 |
| WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data | Sep 17, 2021 | BenchmarkingBIG-bench Machine Learning | —Unverified | 0 |
| Word Complexity Estimation for Japanese Lexical Simplification | May 1, 2020 | BenchmarkingLexical Simplification | —Unverified | 0 |
| WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models | May 14, 2025 | Benchmarking | —Unverified | 0 |
| Writing as a testbed for open ended agents | Mar 25, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods | Feb 5, 2025 | Benchmarking | —Unverified | 0 |
| XCSP3: An Integrated Format for Benchmarking Combinatorial Constrained Problems | Nov 10, 2016 | Benchmarking | —Unverified | 0 |
| XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis | Jun 26, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval | May 28, 2025 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Yesil o1 Pro: Evidence-Based AI Model for Health and Benchmarking in Clinical Decision Support | Feb 15, 2025 | BenchmarkingEpidemiology | —Unverified | 0 |
| Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease | Sep 21, 2017 | BenchmarkingClassification | —Unverified | 0 |
| You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain | Jan 23, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| Zero-Forcing Max-Power Beamforming for Hybrid mmWave Full-Duplex MIMO Systems | Feb 29, 2020 | Benchmarking | —Unverified | 0 |
| Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models | Apr 1, 2025 | Benchmarking | —Unverified | 0 |
| Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis | Aug 27, 2024 | BenchmarkingLarge Language Model | —Unverified | 0 |
| λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics | Nov 28, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama | Mar 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| LAMBDA: Covering the Solution Set of Black-Box Inequality by Search Space Quantization | Mar 25, 2022 | BenchmarkingQuantization | —Unverified | 0 |
| Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification | Sep 2, 2024 | Benchmarking | —Unverified | 0 |
| LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions | Jun 3, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance | Feb 17, 2025 | BenchmarkingDependency Parsing | —Unverified | 0 |
| Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance | Jul 18, 2024 | Benchmarking | —Unverified | 0 |
| Language Models for Automated Classification of Brain MRI Reports and Growth Chart Generation | Mar 15, 2025 | Benchmarking | —Unverified | 0 |
| Can LLMs Capture Human Preferences? | May 4, 2023 | Benchmarking | —Unverified | 0 |
| Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning | Oct 3, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices | Oct 4, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Large Language Models are Null-Shot Learners | Jan 16, 2024 | Arithmetic ReasoningBenchmarking | —Unverified | 0 |
| Large Language Models are Few-Shot Clinical Information Extractors | May 25, 2022 | Benchmarkingcoreference-resolution | —Unverified | 0 |
| Large Language Models as Automated Aligners for benchmarking Vision-Language Models | Nov 24, 2023 | BenchmarkingWorld Knowledge | —Unverified | 0 |
| Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens | Jun 10, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level | Nov 5, 2024 | Bayesian OptimisationBenchmarking | —Unverified | 0 |
| Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding | Jan 24, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models | Jan 9, 2025 | BenchmarkingPhilosophical Reflection | —Unverified | 0 |
| Large-scale Benchmarking of Metaphor-based Optimization Heuristics | Feb 15, 2024 | BenchmarkingExperimental Design | —Unverified | 0 |
| Large-Scale Quantum Separability Through a Reproducible Machine Learning Lens | Jun 15, 2023 | Benchmarking | —Unverified | 0 |
| Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics | Jan 10, 2024 | Anomaly SegmentationAutonomous Driving | —Unverified | 0 |
| Latent Variable Models for Visual Question Answering | Jan 16, 2021 | BenchmarkingQuestion Answering | —Unverified | 0 |