| Visual Place Recognition for Large-Scale UAV Applications | Jul 20, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| MUPAX: Multidimensional Problem Agnostic eXplainable AI | Jul 17, 2025 | Anatomical Landmark DetectionAudio Classification | —Unverified | 0 |
| Training Transformers with Enforced Lipschitz Constants | Jul 17, 2025 | Benchmarking | —Unverified | 0 |
| Disentangling coincident cell events using deep transfer learning and compressive sensing | Jul 17, 2025 | BenchmarkingCompressive Sensing | —Unverified | 0 |
| DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition | Jul 16, 2025 | BenchmarkingKnowledge Distillation | CodeCode Available | 0 |
| DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering | Jul 15, 2025 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning | Jul 15, 2025 | BenchmarkingFederated Learning | CodeCode Available | 0 |
| A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion | Jul 15, 2025 | BenchmarkingPoint Cloud Completion | —Unverified | 0 |
| DCR: Quantifying Data Contamination in LLMs Evaluation | Jul 15, 2025 | Arithmetic ReasoningBenchmarking | CodeCode Available | 0 |
| CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance | Jul 14, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks | Jul 14, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop | Jul 14, 2025 | Benchmarking | —Unverified | 0 |
| MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking | Jul 14, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models | Jul 13, 2025 | AttributeBenchmarking | CodeCode Available | 0 |
| Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift | Jul 12, 2025 | BenchmarkingTransfer Learning | —Unverified | 0 |
| Identifying the Smallest Adversarial Load Perturbations that Render DC-OPF Infeasible | Jul 10, 2025 | Adversarial AttackBenchmarking | CodeCode Available | 0 |
| Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning | Jul 9, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset | Jul 9, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| A Systematic Analysis of Hybrid Linear Attention | Jul 8, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study | Jul 8, 2025 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations | Jul 8, 2025 | 6D Pose Estimation6D Pose Estimation using RGB | CodeCode Available | 0 |
| SQLBarber: A System Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads | Jul 8, 2025 | Benchmarking | —Unverified | 0 |
| Inaugural MOASEI Competition at AAMAS'2025: A Technical Report | Jul 7, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models | Jul 5, 2025 | BenchmarkingGPU | CodeCode Available | 1 |
| GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning | Jul 4, 2025 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking | Jul 4, 2025 | BenchmarkingNavigate | CodeCode Available | 0 |
| LANTERN: A Machine Learning Framework for Lipid Nanoparticle Transfection Efficiency Prediction | Jul 3, 2025 | Benchmarking | CodeCode Available | 0 |
| Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data | Jul 3, 2025 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks | Jul 3, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation | Jul 1, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| State and Memory is All You Need for Robust and Reliable AI Agents | Jun 30, 2025 | AllBenchmarking | —Unverified | 0 |
| Point Cloud Compression and Objective Quality Assessment: A Survey | Jun 28, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | Jun 26, 2025 | Benchmarking | —Unverified | 0 |
| mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale | Jun 26, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation | Jun 26, 2025 | AttributeBenchmarking | —Unverified | 0 |
| Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation | Jun 26, 2025 | BenchmarkingTransfer Learning | CodeCode Available | 0 |
| CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions | Jun 26, 2025 | BenchmarkingDrug Design | CodeCode Available | 1 |
| scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection | Jun 25, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans | Jun 25, 2025 | Action DetectionBenchmarking | —Unverified | 0 |
| FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization | Jun 25, 2025 | BenchmarkingContrastive Learning | —Unverified | 0 |
| AI-Driven MRI-based Brain Tumour Segmentation Benchmarking | Jun 25, 2025 | BenchmarkingImage Segmentation | —Unverified | 0 |
| inMOTIFin: a lightweight end-to-end simulation software for regulatory sequences | Jun 25, 2025 | Benchmarking | CodeCode Available | 0 |
| HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction | Jun 25, 2025 | BenchmarkingPerson Identification | CodeCode Available | 0 |
| Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision | Jun 25, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time Series | Jun 25, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| A Survey of Predictive Maintenance Methods: An Analysis of Prognostics via Classification and Regression | Jun 25, 2025 | BenchmarkingManagement | —Unverified | 0 |
| BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos | Jun 25, 2025 | Artifact DetectionBenchmarking | —Unverified | 0 |
| WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads | Jun 25, 2025 | Benchmarking | CodeCode Available | 1 |
| Quantitative Benchmarking of Anomaly Detection Methods in Digital Pathology | Jun 24, 2025 | Anomaly DetectionArtifact Detection | —Unverified | 0 |
| MDR-DeePC: Model-Inspired Distributionally Robust Data-Enabled Predictive Control | Jun 24, 2025 | Benchmarking | —Unverified | 0 |