| Benchmarking MOEAs for solving continuous multi-objective RL problems | May 19, 2025 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 0 |
| LEXam: Benchmarking Legal Reasoning on 340 Law Exams | May 19, 2025 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity | May 19, 2025 | Benchmarkingfeature selection | CodeCode Available | 0 |
| CompBench: Benchmarking Complex Instruction-guided Image Editing | May 18, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| OSS-Bench: Benchmark Generator for Coding LLMs | May 18, 2025 | Benchmarking | CodeCode Available | 0 |
| ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models | May 18, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind | May 18, 2025 | BenchmarkingScene Understanding | —Unverified | 0 |
| Disambiguation in Conversational Question Answering in the Era of LLM: A Survey | May 18, 2025 | BenchmarkingConversational Question Answering | —Unverified | 0 |
| GenderBench: Evaluation Suite for Gender Biases in LLMs | May 17, 2025 | Benchmarking | CodeCode Available | 0 |
| Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025) | May 17, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds | May 17, 2025 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation | May 17, 2025 | Benchmarking | —Unverified | 0 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems | May 16, 2025 | BenchmarkingMixture-of-Experts | —Unverified | 0 |
| GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents | May 16, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| Benchmarking CFAR and CNN-based Peak Detection Algorithms in ISAC under Hardware Impairments | May 16, 2025 | BenchmarkingIntegrated sensing and communication | —Unverified | 0 |
| Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models | May 16, 2025 | Benchmarking | —Unverified | 0 |
| Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese | May 16, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning Tasks | May 16, 2025 | BenchmarkingLink Prediction | CodeCode Available | 0 |
| Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale | May 16, 2025 | BenchmarkingTAG | —Unverified | 0 |
| STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking | May 16, 2025 | Benchmarking | CodeCode Available | 0 |
| CleanPatrick: A Benchmark for Image Data Cleaning | May 16, 2025 | BenchmarkingLabel Error Detection | CodeCode Available | 0 |
| Visual Anomaly Detection under Complex View-Illumination Interplay: A Large-Scale Benchmark | May 16, 2025 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| Relation Extraction Across Entire Books to Reconstruct Community Networks: The AffilKG Datasets | May 16, 2025 | BenchmarkingKnowledge Graphs | —Unverified | 0 |
| Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities | May 16, 2025 | Benchmarking | —Unverified | 0 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges | May 16, 2025 | BenchmarkingState Estimation | CodeCode Available | 0 |
| TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems | May 16, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding | May 15, 2025 | BenchmarkingSemantic Communication | —Unverified | 0 |
| PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language | May 15, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 0 |
| JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation | May 15, 2025 | BenchmarkingDepth Estimation | —Unverified | 0 |
| What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs | May 15, 2025 | AllBenchmarking | —Unverified | 0 |
| Real-World fNIRS-Based Brain-Computer Interfaces: Benchmarking Deep Learning and Classical Models in Interactive Gaming | May 15, 2025 | BenchmarkingData Augmentation | —Unverified | 0 |
| DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs | May 15, 2025 | BenchmarkingFairness | —Unverified | 0 |
| Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization | May 15, 2025 | BenchmarkingClustering | —Unverified | 0 |
| GNN-Suite: a Graph Neural Network Benchmarking Framework for Biomedical Informatics | May 15, 2025 | BenchmarkingGraph Neural Network | CodeCode Available | 0 |
| On the Evaluation of Engineering Artificial General Intelligence | May 15, 2025 | Benchmarking | —Unverified | 0 |
| Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M | May 15, 2025 | BenchmarkingMemorization | CodeCode Available | 0 |
| WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models | May 14, 2025 | Benchmarking | —Unverified | 0 |
| VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts | May 14, 2025 | BenchmarkingForm | —Unverified | 0 |
| RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo | May 14, 2025 | BenchmarkingOptical Flow Estimation | —Unverified | 0 |
| KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning | May 14, 2025 | BenchmarkingMMLU | —Unverified | 0 |
| BioVFM-21M: Benchmarking and Scaling Self-Supervised Vision Foundation Models for Biomedical Image Analysis | May 14, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 0 |
| ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation | May 14, 2025 | BenchmarkingDeformable Object Manipulation | —Unverified | 0 |
| TARGET: Benchmarking Table Retrieval for Generative Tasks | May 14, 2025 | BenchmarkingRepresentation Learning | —Unverified | 0 |
| A Standardized Benchmark Set of Clustering Problem Instances for Comparing Black-Box Optimizers | May 14, 2025 | BenchmarkingClustering | —Unverified | 0 |
| How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference | May 14, 2025 | Benchmarking | —Unverified | 0 |
| Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora | May 13, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| ExEBench: Benchmarking Foundation Models on Extreme Earth Events | May 13, 2025 | BenchmarkingManagement | CodeCode Available | 0 |