| VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination | Mar 17, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Advancing Human-Machine Teaming: Concepts, Challenges, and Applications | Mar 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era | Mar 16, 2025 | BenchmarkingImage Captioning | —Unverified | 0 |
| Dataset Properties Shape the Success of Neuroimaging-Based Patient Stratification: A Benchmarking Analysis Across Clustering Algorithms | Mar 15, 2025 | BenchmarkingBrain Morphometry | —Unverified | 0 |
| Language Models for Automated Classification of Brain MRI Reports and Growth Chart Generation | Mar 15, 2025 | Benchmarking | —Unverified | 0 |
| Genicious: Contextual Few-shot Prompting for Insights Discovery | Mar 15, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Dynamic Obstacle Avoidance with Bounded Rationality Adversarial Reinforcement Learning | Mar 14, 2025 | BenchmarkingNavigate | —Unverified | 0 |
| InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences | Mar 14, 2025 | BenchmarkingImage Restoration | —Unverified | 0 |
| RESPONSE: Benchmarking the Ability of Language Models to Undertake Commonsense Reasoning in Crisis Situation | Mar 14, 2025 | Benchmarking | —Unverified | 0 |
| Challenges and Advancements in Modeling Shock Fronts with Physics-Informed Neural Networks: A Review and Benchmarking Study | Mar 14, 2025 | Benchmarking | —Unverified | 0 |
| VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity | Mar 14, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama | Mar 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| A Benchmarking Study of Vision-based Robotic Grasping Algorithms | Mar 14, 2025 | BenchmarkingRobotic Grasping | CodeCode Available | 0 |
| V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning | Mar 14, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 |
| Enhancing Hand Palm Motion Gesture Recognition by Eliminating Reference Frame Bias via Frame-Invariant Similarity Measures | Mar 14, 2025 | BenchmarkingGesture Recognition | —Unverified | 0 |
| Heterogeneous graph neural networks for species distribution modeling | Mar 14, 2025 | Benchmarking | —Unverified | 0 |
| DarkBench: Benchmarking Dark Patterns in Large Language Models | Mar 13, 2025 | Benchmarking | —Unverified | 0 |
| ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content | Mar 13, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs | Mar 13, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| MarineGym: A High-Performance Reinforcement Learning Platform for Underwater Robotics | Mar 12, 2025 | BenchmarkingGPU | —Unverified | 0 |
| CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding | Mar 12, 2025 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models | Mar 12, 2025 | BenchmarkingFairness | —Unverified | 0 |
| Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models | Mar 11, 2025 | BenchmarkingHyperparameter Optimization | CodeCode Available | 0 |
| Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking | Mar 11, 2025 | Benchmarking | —Unverified | 0 |
| Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank Study | Mar 11, 2025 | Benchmarking | —Unverified | 0 |
| Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges | Mar 11, 2025 | Benchmarking | CodeCode Available | 0 |
| ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness | Mar 11, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies | Mar 10, 2025 | BenchmarkingEthics | —Unverified | 0 |
| Skelite: Compact Neural Networks for Efficient Iterative Skeletonization | Mar 10, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 0 |
| Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models | Mar 10, 2025 | AllBenchmarking | —Unverified | 0 |
| Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models | Mar 9, 2025 | Benchmarking | —Unverified | 0 |
| General Scales Unlock AI Evaluation with Explanatory and Predictive Power | Mar 9, 2025 | BenchmarkingSpecificity | —Unverified | 0 |
| Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems | Mar 9, 2025 | Benchmarking | —Unverified | 0 |
| Steerable Pyramid Weighted Loss: Multi-Scale Adaptive Weighting for Semantic Segmentation | Mar 9, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning | Mar 9, 2025 | BenchmarkingDecision Making | CodeCode Available | 0 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Removing Multiple Hybrid Adverse Weather in Video via a Unified Model | Mar 8, 2025 | BenchmarkingVideo Restoration | —Unverified | 0 |
| UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces | Mar 8, 2025 | Benchmarkingcounterfactual | —Unverified | 0 |
| Understanding the Limits of Lifelong Knowledge Editing in LLMs | Mar 7, 2025 | Benchmarkingknowledge editing | —Unverified | 0 |
| Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol | Mar 7, 2025 | BenchmarkingBug fixing | —Unverified | 0 |
| FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance | Mar 7, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders | Mar 7, 2025 | BenchmarkingClick-Through Rate Prediction | —Unverified | 0 |
| Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature Perturbation | Mar 7, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination | Mar 6, 2025 | Benchmarking | —Unverified | 0 |
| InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference | Mar 6, 2025 | Benchmarking | —Unverified | 0 |
| LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression | Mar 6, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases | Mar 6, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models | Mar 6, 2025 | BenchmarkingContinual Learning | CodeCode Available | 0 |
| ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions | Mar 6, 2025 | BenchmarkingHumanEval | CodeCode Available | 0 |
| Benchmarking Reasoning Robustness in Large Language Models | Mar 6, 2025 | BenchmarkingMath | —Unverified | 0 |