| BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games | Nov 20, 2024 | BenchmarkingNetHack | —Unverified | 0 | 0 |
| Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing | Oct 22, 2024 | AttributeBenchmarking | —Unverified | 0 | 0 |
| Balanced Random Survival Forests for Extremely Unbalanced, Right Censored Data | Mar 24, 2018 | BenchmarkingPrediction | —Unverified | 0 | 0 |
| A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness | May 5, 2023 | BenchmarkingDataset Distillation | —Unverified | 0 | 0 |
| Portfolio Benchmarking under Drawdown Constraint and Stochastic Sharpe Ratio | Oct 26, 2016 | Benchmarking | —Unverified | 0 | 0 |
| PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions | Jun 20, 2024 | Animal Pose EstimationAutonomous Driving | —Unverified | 0 | 0 |
| Pose Estimation for Non-Cooperative Spacecraft Rendezvous Using Convolutional Neural Networks | Sep 19, 2018 | BenchmarkingImage Generation | —Unverified | 0 | 0 |
| BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving | Mar 6, 2024 | Automated Theorem ProvingBenchmarking | —Unverified | 0 | 0 |
| Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation | May 1, 2025 | BenchmarkingPosition | —Unverified | 0 | 0 |
| BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 | 0 |
| Position: Benchmarking is Limited in Reinforcement Learning Research | Jun 23, 2024 | BenchmarkingPosition | —Unverified | 0 | 0 |
| Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks | Feb 20, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 | 0 |
| Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods | May 2, 2024 | Benchmarking | —Unverified | 0 | 0 |
| Position: There are no Champions in Long-Term Time Series Forecasting | Feb 19, 2025 | BenchmarkingPosition | —Unverified | 0 | 0 |
| Post-FEC BER Benchmarking for Bit-Interleaved Coded Modulation with Probabilistic Shaping | Apr 24, 2020 | Benchmarking | —Unverified | 0 | 0 |
| Post-hoc labeling of arbitrary EEG recordings for data-efficient evaluation of neural decoding methods | Nov 22, 2017 | BenchmarkingEEG | —Unverified | 0 | 0 |
| Deep Neural Operator Driven Real Time Inference for Nuclear Systems to Enable Digital Twin Solutions | Aug 15, 2023 | BenchmarkingComputational Efficiency | —Unverified | 0 | 0 |
| PowerGraph: A power grid benchmark dataset for graph neural networks | Feb 5, 2024 | ArticlesBenchmarking | —Unverified | 0 | 0 |
| Power Line Communication vs. Talkative Power Conversion: A Benchmarking Study | Apr 16, 2025 | Benchmarking | —Unverified | 0 | 0 |
| AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs | Jun 5, 2025 | BenchmarkingVideo Understanding | —Unverified | 0 | 0 |
| UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | May 21, 2025 | BenchmarkingImitation Learning | —Unverified | 0 | 0 |
| UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges | Oct 31, 2023 | Benchmarking | —Unverified | 0 | 0 |
| Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding | Jan 7, 2025 | BenchmarkingCode Generation | —Unverified | 0 | 0 |
| A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval | Nov 30, 2023 | BenchmarkingRetrieval | —Unverified | 0 | 0 |
| Practical, Fast and Robust Point Cloud Registration for 3D Scene Stitching and Object Localization | Nov 8, 2021 | 3D Feature MatchingBenchmarking | —Unverified | 0 | 0 |