| PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning | Jun 24, 2025 | BenchmarkingDrug Discovery | CodeCode Available | 2 |
| QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges | Jun 24, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping | Jun 23, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey | Jun 23, 2025 | BenchmarkingSurvey | —Unverified | 0 |
| Simulation-Based Sensitivity Analysis in Optimal Treatment Regimes and Causal Decomposition with Individualized Interventions | Jun 23, 2025 | BenchmarkingSensitivity | —Unverified | 0 |
| Staining normalization in histopathology: Method benchmarking using multicenter dataset | Jun 23, 2025 | Benchmarking | —Unverified | 0 |
| Survey of HPC in US Research Institutions | Jun 23, 2025 | BenchmarkingGPU | —Unverified | 0 |
| Benchmarking Music Generation Models and Metrics via Human Preference Studies | Jun 23, 2025 | BenchmarkingMusic Generation | —Unverified | 0 |
| Identifiable Convex-Concave Regression via Sub-gradient Regularised Least Squares | Jun 22, 2025 | Benchmarkingregression | —Unverified | 0 |
| Statistical Multicriteria Evaluation of LLM-Generated Text | Jun 22, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| On the Robustness of Human-Object Interaction Detection against Distribution Shift | Jun 22, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |
| TAB: Unified Benchmarking of Time Series Anomaly Detection Methods | Jun 22, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 2 |
| ConsumerBench: Benchmarking Generative AI Applications on End-User Devices | Jun 21, 2025 | BenchmarkingCPU | CodeCode Available | 1 |
| Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking | Jun 21, 2025 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| A Comparative Analysis of Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) as Dimensionality Reduction Techniques | Jun 20, 2025 | BenchmarkingDimensionality Reduction | —Unverified | 0 |
| Universal Music Representations? Evaluating Foundation Models on World Music Corpora | Jun 20, 2025 | BenchmarkingFew-Shot Learning | CodeCode Available | 0 |
| TabArena: A Living Benchmark for Machine Learning on Tabular Data | Jun 20, 2025 | Benchmarking | CodeCode Available | 3 |
| Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors | Jun 19, 2025 | BenchmarkingFace Swapping | —Unverified | 0 |
| InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Jun 19, 2025 | BenchmarkingDescriptive | CodeCode Available | 1 |
| OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents | Jun 19, 2025 | Benchmarking | —Unverified | 0 |
| Finance Language Model Evaluation (FLaME) | Jun 18, 2025 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models | Jun 17, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery | Jun 17, 2025 | BenchmarkingDrug Discovery | —Unverified | 0 |
| PGLib-CO2: A Power Grid Library for Computing and Optimizing Carbon Emissions | Jun 17, 2025 | Benchmarking | —Unverified | 0 |
| A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning | Jun 17, 2025 | BenchmarkingSelf-Supervised Learning | —Unverified | 0 |
| GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies | Jun 17, 2025 | Benchmarking | CodeCode Available | 1 |
| ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge | Jun 17, 2025 | BenchmarkingRetrieval | CodeCode Available | 0 |
| Egocentric Human-Object Interaction Detection: A New Benchmark and Method | Jun 17, 2025 | BenchmarkingHuman-Object Interaction Detection | —Unverified | 0 |
| The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products | Jun 16, 2025 | Benchmarking | CodeCode Available | 1 |
| C-TLSAN: Content-Enhanced Time-Aware Long- and Short-Term Attention Network for Personalized Recommendation | Jun 16, 2025 | BenchmarkingRecommendation Systems | CodeCode Available | 0 |
| A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects | Jun 16, 2025 | BenchmarkingInstance Segmentation | —Unverified | 0 |
| Deep Diffusion Models and Unsupervised Hyperspectral Unmixing for Realistic Abundance Map Synthesis | Jun 16, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |
| Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring | Jun 16, 2025 | BenchmarkingFew-Shot Learning | —Unverified | 0 |
| Robustness of Reinforcement Learning-Based Traffic Signal Control under Incidents: A Comparative Study | Jun 16, 2025 | BenchmarkingTraffic Signal Control | —Unverified | 0 |
| JENGA: Object selection and pose estimation for robotic grasping from a stack | Jun 16, 2025 | BenchmarkingObject | —Unverified | 0 |
| ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies | Jun 15, 2025 | Benchmarking | CodeCode Available | 1 |
| A large-scale, physically-based synthetic dataset for satellite pose estimation | Jun 15, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios | Jun 15, 2025 | Benchmarking | CodeCode Available | 0 |
| OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics | Jun 14, 2025 | Benchmarking | CodeCode Available | 4 |
| Delving into Instance-Dependent Label Noise in Graph Data: A Comprehensive Study and Benchmark | Jun 14, 2025 | BenchmarkingGraph Learning | CodeCode Available | 0 |
| ANIRA: An Architecture for Neural Network Inference in Real-Time Audio Applications | Jun 14, 2025 | Benchmarking | CodeCode Available | 3 |
| Learning Best Paths in Quantum Networks | Jun 14, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables | Jun 13, 2025 | BenchmarkingDescriptive | —Unverified | 0 |
| SemanticST: Spatially Informed Semantic Graph Learning for Clustering, Integration, and Scalable Analysis of Spatial Transcriptomics | Jun 13, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| Temporal cross-validation impacts multivariate time series subsequence anomaly detection evaluation | Jun 13, 2025 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023 | Jun 13, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| EconGym: A Scalable AI Testbed with Diverse Economic Tasks | Jun 13, 2025 | Benchmarking | —Unverified | 0 |
| Mind the XAI Gap: A Human-Centered LLM Framework for Democratizing Explainable AI | Jun 13, 2025 | BenchmarkingIn-Context Learning | CodeCode Available | 0 |
| SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks | Jun 13, 2025 | BenchmarkingLarge Language Model | CodeCode Available | 2 |
| HyBiomass: Global Hyperspectral Imagery Benchmark Dataset for Evaluating Geospatial Foundation Models in Forest Aboveground Biomass Estimation | Jun 12, 2025 | Benchmarking | —Unverified | 0 |