| MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation | Feb 3, 2025 | BenchmarkingFairness | —Unverified | 0 |
| SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering | Feb 3, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models | Feb 2, 2025 | Benchmarking | CodeCode Available | 1 |
| Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural Networks | Feb 2, 2025 | Benchmarking | CodeCode Available | 0 |
| True Online TD-Replan(lambda) Achieving Planning through Replaying | Jan 31, 2025 | Benchmarking | —Unverified | 0 |
| Evolving Hard Maximum Cut Instances for Quantum Approximate Optimization Algorithms | Jan 30, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency | Jan 30, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Unraveling the Capabilities of Language Models in News Summarization | Jan 30, 2025 | BenchmarkingFew-Shot Learning | CodeCode Available | 0 |
| The iToBoS dataset: skin region images extracted from 3D total body photographs for lesion detection | Jan 30, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research | Jan 29, 2025 | Benchmarking | —Unverified | 0 |
| SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model | Jan 28, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns | Jan 28, 2025 | Adversarial AttackBenchmarking | CodeCode Available | 1 |
| Molecular-driven Foundation Model for Oncologic Pathology | Jan 28, 2025 | BenchmarkingDiagnostic | CodeCode Available | 4 |
| Benchmarking Quantum Convolutional Neural Networks for Signal Classification in Simulated Gamma-Ray Burst Detection | Jan 28, 2025 | Benchmarking | —Unverified | 0 |
| Making Sense of Data in the Wild: Data Analysis Automation at Scale | Jan 27, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding | Jan 27, 2025 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| A Benchmarking Environment for Worker Flexibility in Flexible Job Shop Scheduling Problems | Jan 27, 2025 | BenchmarkingEvolutionary Algorithms | —Unverified | 0 |
| Transfer of Knowledge through Reverse Annealing: A Preliminary Analysis of the Benefits and What to Share | Jan 27, 2025 | BenchmarkingTransfer Learning | —Unverified | 0 |
| IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding | Jan 27, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation | Jan 27, 2025 | BenchmarkingC++ code | —Unverified | 0 |
| Benchmarking Quantum Reinforcement Learning | Jan 27, 2025 | Benchmarkingreinforcement-learning | CodeCode Available | 0 |
| GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree search | Jan 26, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry | Jan 26, 2025 | BenchmarkingObject Detection | —Unverified | 0 |
| Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets? | Jan 26, 2025 | BenchmarkingSelf-Supervised Learning | —Unverified | 0 |
| Beyond Benchmarks: On The False Promise of AI Regulation | Jan 26, 2025 | Benchmarking | —Unverified | 0 |
| EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning | Jan 25, 2025 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 7 |
| Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study | Jan 25, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking global optimization techniques for unmanned aerial vehicle path planning | Jan 24, 2025 | Benchmarkingglobal-optimization | —Unverified | 0 |
| MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents | Jan 24, 2025 | Benchmarking | CodeCode Available | 3 |
| Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems | Jan 24, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| The Karp Dataset | Jan 24, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video | Jan 24, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 2 |
| Enhancing Biomedical Relation Extraction with Directionality | Jan 23, 2025 | BenchmarkingDocument-level Relation Extraction | CodeCode Available | 1 |
| AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning | Jan 23, 2025 | Benchmarkingimage-classification | —Unverified | 0 |
| DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale | Jan 23, 2025 | Benchmarking | —Unverified | 0 |
| You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain | Jan 23, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| RAG-Reward: Optimizing RAG with Reward Modeling and RLHF | Jan 22, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| Leveraging LLMs to Create a Haptic Devices' Recommendation System | Jan 22, 2025 | Benchmarking | —Unverified | 0 |
| Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities | Jan 22, 2025 | BenchmarkingReferring Expression | —Unverified | 0 |
| Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning | Jan 22, 2025 | Benchmarking | CodeCode Available | 0 |
| CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization | Jan 22, 2025 | Benchmarkingregression | —Unverified | 0 |
| Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) | Jan 21, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes | Jan 21, 2025 | Benchmarking | —Unverified | 0 |
| Optimally-Weighted Maximum Mean Discrepancy Framework for Continual Learning | Jan 21, 2025 | BenchmarkingContinual Learning | —Unverified | 0 |
| Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems | Jan 21, 2025 | Autonomous VehiclesBenchmarking | CodeCode Available | 0 |
| Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing | Jan 20, 2025 | BenchmarkingEvolutionary Algorithms | —Unverified | 0 |
| Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model | Jan 20, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Large Language Models via Random Variables | Jan 20, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models | Jan 19, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |