| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |
| Writing as a testbed for open ended agents | Mar 25, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs | Mar 25, 2025 | BenchmarkingScene Segmentation | CodeCode Available | 1 |
| Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery | Mar 24, 2025 | BenchmarkingHumanitarian | CodeCode Available | 1 |
| Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling | Mar 24, 2025 | BenchmarkingOpenAI Gym | CodeCode Available | 0 |
| LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages | Mar 24, 2025 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness | Mar 24, 2025 | BenchmarkingSemantic Segmentation | CodeCode Available | 1 |
| Benchmarking Post-Hoc Unknown-Category Detection in Food Recognition | Mar 24, 2025 | BenchmarkingFood Recognition | —Unverified | 0 |
| Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages | Mar 24, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis | Mar 24, 2025 | BenchmarkingImage Reconstruction | —Unverified | 0 |
| EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation | Mar 24, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |
| SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining | Mar 23, 2025 | 3DGSBenchmarking | CodeCode Available | 3 |
| GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks | Mar 23, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering | Mar 23, 2025 | BenchmarkingChart Question Answering | —Unverified | 0 |
| A Study on Neuro-Symbolic Artificial Intelligence: Healthcare Perspectives | Mar 23, 2025 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| Regularization of ML models for Earth systems by using longer model timesteps | Mar 23, 2025 | Benchmarking | —Unverified | 0 |
| Accurate Peak Detection in Multimodal Optimization via Approximated Landscape Learning | Mar 23, 2025 | Benchmarking | CodeCode Available | 0 |
| CardioTabNet: A Novel Hybrid Transformer Model for Heart Disease Prediction using Tabular Medical Data | Mar 22, 2025 | BenchmarkingDisease Prediction | —Unverified | 0 |
| 4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding | Mar 22, 2025 | BenchmarkingObject | CodeCode Available | 0 |
| V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction | Mar 22, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 1 |
| IceBench: A Benchmark for Deep Learning based Sea Ice Type Classification | Mar 22, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| Benchmark Dataset for Pore-Scale CO2-Water Interaction | Mar 22, 2025 | Benchmarking | —Unverified | 0 |
| Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer | Mar 21, 2025 | BenchmarkingVideo Generation | CodeCode Available | 2 |
| CausalRivers -- Scaling up benchmarking of causal discovery for real-world time-series | Mar 21, 2025 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| ContextGNN goes to Elliot: Towards Benchmarking Relational Deep Learning for Static Link Prediction (aka Personalized Item Recommendation) | Mar 20, 2025 | BenchmarkingLink Prediction | CodeCode Available | 0 |
| QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEs | Mar 20, 2025 | BenchmarkingPhysics-informed machine learning | CodeCode Available | 1 |
| A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough? | Mar 20, 2025 | Benchmarking | —Unverified | 0 |
| Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models | Mar 20, 2025 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 4 |
| ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph | Mar 20, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination | Mar 20, 2025 | BenchmarkingLarge Language Model | CodeCode Available | 1 |
| DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs | Mar 20, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI | Mar 20, 2025 | BenchmarkingFairness | —Unverified | 0 |
| FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding | Mar 19, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks | Mar 19, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| Language-based Image Colorization: A Benchmark and Beyond | Mar 19, 2025 | BenchmarkingColorization | CodeCode Available | 0 |
| Kolmogorov-Arnold Network for Transistor Compact Modeling | Mar 19, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Large Language Models for Handwritten Text Recognition | Mar 19, 2025 | BenchmarkingHandwritten Text Recognition | —Unverified | 0 |
| VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning | Mar 19, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes | Mar 19, 2025 | 3D Semantic SegmentationBenchmarking | —Unverified | 0 |
| ImputeGAP: A Comprehensive Library for Time Series Imputation | Mar 19, 2025 | BenchmarkingImputation | —Unverified | 0 |
| Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack | Mar 18, 2025 | 8kBenchmarking | —Unverified | 0 |
| COPA: Comparing the Incomparable to Explore the Pareto Front | Mar 18, 2025 | AutoMLBenchmarking | —Unverified | 0 |
| ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models | Mar 18, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System | Mar 18, 2025 | BenchmarkingIn-Context Learning | CodeCode Available | 1 |
| Benchmarking Failures in Tool-Augmented Language Models | Mar 18, 2025 | BenchmarkingText Generation | CodeCode Available | 0 |
| HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an Open Leaderboard | Mar 18, 2025 | BenchmarkingHuman Dynamics | —Unverified | 0 |
| Stable Virtual Camera: Generative View Synthesis with Diffusion Models | Mar 18, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis | Mar 18, 2025 | BenchmarkingDrug Response Prediction | CodeCode Available | 0 |
| Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering | Mar 18, 2025 | BenchmarkingDescriptive | —Unverified | 0 |
| CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models | Mar 18, 2025 | BenchmarkingSpatial Reasoning | CodeCode Available | 0 |