| PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs | Jun 24, 2024 | BenchmarkingMachine Unlearning | —Unverified | 0 | 0 |
| Pitfalls of topology-aware image segmentation | Dec 19, 2024 | BenchmarkingImage Segmentation | —Unverified | 0 | 0 |
| pix2pockets: Shot Suggestions in 8-Ball Pool from a Single Image in the Wild | Apr 16, 2025 | Benchmarkingobject-detection | —Unverified | 0 | 0 |
| A Computer Vision System to Localize and Classify Wastes on the Streets | Oct 31, 2017 | Benchmarking | —Unverified | 0 | 0 |
| Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities | May 16, 2025 | Benchmarking | —Unverified | 0 | 0 |
| A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects | Jun 16, 2025 | BenchmarkingInstance Segmentation | —Unverified | 0 | 0 |
| PKLot-A robust dataset for parking lot classification | Jul 1, 2015 | BenchmarkingClassification | —Unverified | 0 | 0 |
| PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI | May 19, 2025 | BenchmarkingMinecraft | —Unverified | 0 | 0 |
| BEADs: Bias Evaluation Across Domains | Jun 6, 2024 | BenchmarkingBias Detection | —Unverified | 0 | 0 |
| BEACON: A Benchmark for Efficient and Accurate Counting of Subgraphs | Apr 15, 2025 | BenchmarkingSubgraph Counting | —Unverified | 0 | 0 |
| Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment | Feb 17, 2025 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 | 0 |
| BBOB Instance Analysis: Landscape Properties and Algorithm Performance across Problem Instances | Nov 29, 2022 | Benchmarking | —Unverified | 0 | 0 |
| Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study | May 23, 2020 | BenchmarkingNetwork Pruning | —Unverified | 0 | 0 |
| Bayesian Multi-type Mean Field Multi-agent Imitation Learning | Dec 1, 2020 | BenchmarkingImitation Learning | —Unverified | 0 | 0 |
| White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs | Apr 16, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 | 0 |
| A Bayesian Model for Bivariate Causal Inference | Dec 24, 2018 | BenchmarkingCausal Inference | —Unverified | 0 | 0 |
| A Comprehensive Study on the Robustness of Image Classification and Object Detection in Remote Sensing: Surveying and Benchmarking | Jun 21, 2023 | Adversarial RobustnessBenchmarking | —Unverified | 0 | 0 |
| A Comprehensive Study on Robustness of Image Classification Models: Benchmarking and Rethinking | Feb 28, 2023 | Adversarial RobustnessBenchmarking | —Unverified | 0 | 0 |
| Barkour: Benchmarking Animal-level Agility with Quadruped Robots | May 24, 2023 | BenchmarkingNavigate | —Unverified | 0 | 0 |
| BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali | Oct 16, 2023 | BenchmarkingData Augmentation | —Unverified | 0 | 0 |
| Point Cloud Compression and Objective Quality Assessment: A Survey | Jun 28, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 | 0 |
| Point Cloud Objective Quality: Benchmarking Features and Quality Evaluation | Apr 4, 2025 | AttributeBenchmarking | —Unverified | 0 | 0 |
| Polarization and Index Modulations: a Theoretical and Practical Perspective | Mar 20, 2018 | BenchmarkingNavigate | —Unverified | 0 | 0 |
| Policy Entropy for Out-of-Distribution Classification | May 25, 2020 | BenchmarkingClassification | —Unverified | 0 | 0 |
| U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | May 23, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 | 0 |
| BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games | Nov 20, 2024 | BenchmarkingNetHack | —Unverified | 0 | 0 |
| Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing | Oct 22, 2024 | AttributeBenchmarking | —Unverified | 0 | 0 |
| Balanced Random Survival Forests for Extremely Unbalanced, Right Censored Data | Mar 24, 2018 | BenchmarkingPrediction | —Unverified | 0 | 0 |
| A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness | May 5, 2023 | BenchmarkingDataset Distillation | —Unverified | 0 | 0 |
| Portfolio Benchmarking under Drawdown Constraint and Stochastic Sharpe Ratio | Oct 26, 2016 | Benchmarking | —Unverified | 0 | 0 |
| PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions | Jun 20, 2024 | Animal Pose EstimationAutonomous Driving | —Unverified | 0 | 0 |
| Pose Estimation for Non-Cooperative Spacecraft Rendezvous Using Convolutional Neural Networks | Sep 19, 2018 | BenchmarkingImage Generation | —Unverified | 0 | 0 |
| BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving | Mar 6, 2024 | Automated Theorem ProvingBenchmarking | —Unverified | 0 | 0 |
| Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation | May 1, 2025 | BenchmarkingPosition | —Unverified | 0 | 0 |
| BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 | 0 |
| Position: Benchmarking is Limited in Reinforcement Learning Research | Jun 23, 2024 | BenchmarkingPosition | —Unverified | 0 | 0 |
| Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks | Feb 20, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 | 0 |
| Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods | May 2, 2024 | Benchmarking | —Unverified | 0 | 0 |
| Position: There are no Champions in Long-Term Time Series Forecasting | Feb 19, 2025 | BenchmarkingPosition | —Unverified | 0 | 0 |
| Post-FEC BER Benchmarking for Bit-Interleaved Coded Modulation with Probabilistic Shaping | Apr 24, 2020 | Benchmarking | —Unverified | 0 | 0 |
| Post-hoc labeling of arbitrary EEG recordings for data-efficient evaluation of neural decoding methods | Nov 22, 2017 | BenchmarkingEEG | —Unverified | 0 | 0 |
| Deep Neural Operator Driven Real Time Inference for Nuclear Systems to Enable Digital Twin Solutions | Aug 15, 2023 | BenchmarkingComputational Efficiency | —Unverified | 0 | 0 |
| PowerGraph: A power grid benchmark dataset for graph neural networks | Feb 5, 2024 | ArticlesBenchmarking | —Unverified | 0 | 0 |
| Power Line Communication vs. Talkative Power Conversion: A Benchmarking Study | Apr 16, 2025 | Benchmarking | —Unverified | 0 | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 | 0 |
| UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | May 21, 2025 | BenchmarkingImitation Learning | —Unverified | 0 | 0 |
| UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges | Oct 31, 2023 | Benchmarking | —Unverified | 0 | 0 |
| Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding | Jan 7, 2025 | BenchmarkingCode Generation | —Unverified | 0 | 0 |
| A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval | Nov 30, 2023 | BenchmarkingRetrieval | —Unverified | 0 | 0 |
| Practical, Fast and Robust Point Cloud Registration for 3D Scene Stitching and Object Localization | Nov 8, 2021 | 3D Feature MatchingBenchmarking | —Unverified | 0 | 0 |