| Benchmarking Online Object Trackers for Underwater Robot Position Locking Applications | Feb 23, 2025 | BenchmarkingObject Tracking | —Unverified | 0 |
| VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs | Feb 23, 2025 | Benchmarking | —Unverified | 0 |
| VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models | Feb 23, 2025 | BenchmarkingSpatial Reasoning | CodeCode Available | 0 |
| Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation | Feb 21, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Methods and Trends in Detecting Generated Images: A Comprehensive Review | Feb 21, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models | Feb 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis | Feb 21, 2025 | 3DGSAutonomous Driving | —Unverified | 0 |
| Probabilistic Robustness in Deep Learning: A Concise yet Comprehensive Guide | Feb 20, 2025 | Adversarial RobustnessBenchmarking | —Unverified | 0 |
| Synthetic Porous Microstructures: Automatic Design, Simulation, and Permeability Analysis | Feb 20, 2025 | Benchmarking | CodeCode Available | 0 |
| Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models | Feb 20, 2025 | BenchmarkingSentence | —Unverified | 0 |
| Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems | Feb 20, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| PredictaBoard: Benchmarking LLM Score Predictability | Feb 20, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Reinforcement Learning with Graph Attention for Routing and Wavelength Assignment with Lightpath Reuse | Feb 20, 2025 | BenchmarkingGraph Attention | —Unverified | 0 |
| Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework | Feb 20, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks | Feb 20, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk | Feb 20, 2025 | Benchmarking | —Unverified | 0 |
| GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking | Feb 19, 2025 | Benchmarking | —Unverified | 0 |
| A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior | Feb 19, 2025 | BenchmarkingMisinformation | —Unverified | 0 |
| Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction | Feb 19, 2025 | BenchmarkingMRI Reconstruction | CodeCode Available | 0 |
| VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare | Feb 19, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Position: There are no Champions in Long-Term Time Series Forecasting | Feb 19, 2025 | BenchmarkingPosition | —Unverified | 0 |
| Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification | Feb 19, 2025 | Benchmarking | —Unverified | 0 |
| EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking | Feb 18, 2025 | BenchmarkingBinary Classification | —Unverified | 0 |