| TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation Prediction | May 20, 2025 | BenchmarkingKnowledge Graphs | CodeCode Available | 1 |
| MedBrowseComp: Benchmarking Medical Deep Research and Computer Use | May 20, 2025 | Benchmarking | —Unverified | 0 |
| Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach | May 20, 2025 | Benchmarking | —Unverified | 0 |
| LLM-based Evaluation Policy Extraction for Ecological Modeling | May 20, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival Analysis | May 20, 2025 | BenchmarkingModel Optimization | CodeCode Available | 0 |
| Benchmarking the Myopic Trap: Positional Bias in Information Retrieval | May 20, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 5 |
| A Data-Driven Method to Identify IBRs with Dominant Participation in Sub-Synchronous Oscillations | May 20, 2025 | Benchmarking | —Unverified | 0 |
| TransBench: Benchmarking Machine Translation for Industrial-Scale Applications | May 20, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI | May 20, 2025 | Anomaly LocalizationBenchmarking | —Unverified | 0 |
| SlangDIT: Benchmarking LLMs in Interpretative Slang Translation | May 20, 2025 | BenchmarkingSentence | —Unverified | 0 |
| ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations | May 20, 2025 | Benchmarking | —Unverified | 0 |
| OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking | May 20, 2025 | Benchmarking | CodeCode Available | 3 |
| Benchmarking data encoding methods in Quantum Machine Learning | May 20, 2025 | BenchmarkingQuantum Machine Learning | —Unverified | 0 |
| DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models | May 20, 2025 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas | May 20, 2025 | BenchmarkingLogical Reasoning | —Unverified | 0 |
| NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation | May 20, 2025 | Autonomous NavigationBenchmarking | —Unverified | 0 |
| SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference | May 19, 2025 | BenchmarkingEEG | —Unverified | 0 |
| HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity | May 19, 2025 | Benchmarkingfeature selection | CodeCode Available | 0 |
| Benchmarking MOEAs for solving continuous multi-objective RL problems | May 19, 2025 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 0 |
| Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference | May 19, 2025 | BenchmarkingCausal Inference | —Unverified | 0 |
| LEXam: Benchmarking Legal Reasoning on 340 Law Exams | May 19, 2025 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models | May 19, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI | May 19, 2025 | BenchmarkingMinecraft | —Unverified | 0 |
| A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design | May 19, 2025 | BenchmarkingDrug Discovery | —Unverified | 0 |
| Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models | May 19, 2025 | BenchmarkingChatbot | CodeCode Available | 1 |
| Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning | May 19, 2025 | Benchmarking | —Unverified | 0 |
| Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities | May 19, 2025 | Automated Theorem ProvingBenchmarking | CodeCode Available | 1 |
| Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning | May 19, 2025 | Benchmarking | CodeCode Available | 0 |
| TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents | May 19, 2025 | AI AgentBenchmarking | CodeCode Available | 1 |
| Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings | May 19, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization | May 18, 2025 | Benchmarking | CodeCode Available | 1 |
| OSS-Bench: Benchmark Generator for Coding LLMs | May 18, 2025 | Benchmarking | CodeCode Available | 0 |
| Disambiguation in Conversational Question Answering in the Era of LLM: A Survey | May 18, 2025 | BenchmarkingConversational Question Answering | —Unverified | 0 |
| ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models | May 18, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| CompBench: Benchmarking Complex Instruction-guided Image Editing | May 18, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks | May 18, 2025 | BenchmarkingMedical Visual Question Answering | CodeCode Available | 1 |
| Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind | May 18, 2025 | BenchmarkingScene Understanding | —Unverified | 0 |
| GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification | May 18, 2025 | Benchmarking | CodeCode Available | 2 |
| Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025) | May 17, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| GenderBench: Evaluation Suite for Gender Biases in LLMs | May 17, 2025 | Benchmarking | CodeCode Available | 0 |
| LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation | May 17, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation | May 17, 2025 | Benchmarking | —Unverified | 0 |
| SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds | May 17, 2025 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges | May 16, 2025 | BenchmarkingState Estimation | CodeCode Available | 0 |
| Benchmarking CFAR and CNN-based Peak Detection Algorithms in ISAC under Hardware Impairments | May 16, 2025 | BenchmarkingIntegrated sensing and communication | —Unverified | 0 |
| Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale | May 16, 2025 | BenchmarkingTAG | —Unverified | 0 |
| ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems | May 16, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems | May 16, 2025 | BenchmarkingMixture-of-Experts | —Unverified | 0 |