| CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks | Jul 14, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop | Jul 14, 2025 | Benchmarking | —Unverified | 0 |
| MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking | Jul 14, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models | Jul 13, 2025 | AttributeBenchmarking | CodeCode Available | 0 |
| Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift | Jul 12, 2025 | BenchmarkingTransfer Learning | —Unverified | 0 |
| Identifying the Smallest Adversarial Load Perturbations that Render DC-OPF Infeasible | Jul 10, 2025 | Adversarial AttackBenchmarking | CodeCode Available | 0 |
| Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning | Jul 9, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset | Jul 9, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| A Systematic Analysis of Hybrid Linear Attention | Jul 8, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations | Jul 8, 2025 | 6D Pose Estimation6D Pose Estimation using RGB | CodeCode Available | 0 |