| An Image Dataset for Benchmarking Recommender Systems with Raw Pixels | Sep 13, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 1 |
| LEAF: A Benchmark for Federated Settings | Dec 3, 2018 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 |
| CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics | May 6, 2025 | Benchmarking | CodeCode Available | 1 |
| animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics | Jun 3, 2024 | Audio ClassificationBenchmarking | CodeCode Available | 1 |
| AD-LLM: Benchmarking Large Language Models for Anomaly Detection | Dec 15, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets | May 7, 2024 | BenchmarkingCancer Classification | CodeCode Available | 1 |
| An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening Models | Mar 15, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| Benchmarking Counterfactual Image Generation | Mar 29, 2024 | BenchmarkingConditional Image Generation | CodeCode Available | 1 |
| AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning Potentials | Nov 29, 2022 | Benchmarking | CodeCode Available | 1 |
| Less Is More: A Comparison of Active Learning Strategies for 3D Medical Image Segmentation | Jul 2, 2022 | Active LearningBenchmarking | CodeCode Available | 1 |
| Combinatorial Optimization with Policy Adaptation using Latent Space Search | Nov 13, 2023 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| Benchmarking Data-driven Surrogate Simulators for Artificial Electromagnetic Materials | Nov 6, 2021 | BenchmarkingNeural Network simulation | CodeCode Available | 1 |
| A Survey of Pathology Foundation Model: Progress and Future Directions | Apr 5, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 1 |
| A Comprehensive Benchmark for RNA 3D Structure-Function Modeling | Mar 27, 2025 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation | Apr 30, 2025 | 3D Molecule GenerationBenchmarking | CodeCode Available | 1 |
| Comics Datasets Framework: Mix of Comics datasets for detection benchmarking | Jul 3, 2024 | BenchmarkingObject | CodeCode Available | 1 |
| Benchmarking Data Science Agents | Feb 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Light Field Salient Object Detection: A Review and Benchmark | Oct 10, 2020 | BenchmarkingObject | CodeCode Available | 1 |
| CoDEx: A Comprehensive Knowledge Graph Completion Benchmark | Sep 16, 2020 | BenchmarkingKnowledge Graph Completion | CodeCode Available | 1 |
| LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment | Oct 28, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care | Sep 16, 2022 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| MC-Blur: A Comprehensive Benchmark for Image Deblurring | Dec 1, 2021 | BenchmarkingDeblurring | CodeCode Available | 1 |
| AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM | Nov 26, 2024 | BenchmarkingText-to-Video Generation | CodeCode Available | 1 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |
| Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19 | Feb 9, 2021 | BenchmarkingQ-Learning | CodeCode Available | 1 |
| Benchmarking deep inverse models over time, and the neural-adjoint method | Sep 27, 2020 | Benchmarking | CodeCode Available | 1 |
| A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification | Nov 28, 2022 | Benchmarkingimage-classification | CodeCode Available | 1 |
| Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit | Sep 7, 2022 | Benchmarking | CodeCode Available | 1 |
| Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents | Feb 27, 2025 | Benchmarking | CodeCode Available | 1 |
| LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond | Oct 13, 2024 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| CODEMENV: Benchmarking Large Language Models on Code Migration | Jun 1, 2025 | Benchmarking | CodeCode Available | 1 |
| CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation | Feb 26, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking | Jan 22, 2020 | Benchmarkingobject-detection | CodeCode Available | 1 |
| Benchmarking Deep Learning Interpretability in Time Series Predictions | Oct 26, 2020 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT | Jul 9, 2021 | BenchmarkingDocument Classification | CodeCode Available | 1 |
| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| CodeS: Natural Language to Code Repository via Multi-Layer Sketch | Mar 25, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Deep Models for Salient Object Detection | Feb 7, 2022 | BenchmarkingObject | CodeCode Available | 1 |
| Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness | Mar 24, 2025 | BenchmarkingSemantic Segmentation | CodeCode Available | 1 |
| New Protocols and Negative Results for Textual Entailment Data Collection | Apr 24, 2020 | BenchmarkingDiversity | CodeCode Available | 1 |
| Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments | Oct 18, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks | Nov 25, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 |
| CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization | Apr 6, 2025 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration | Nov 14, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Coarse-to-Fine Q-attention with Learned Path Ranking | Apr 4, 2022 | Benchmarking | CodeCode Available | 1 |
| High-Dimensional Inference in Bayesian Networks | Dec 16, 2021 | BenchmarkingVocal Bursts Intensity Prediction | CodeCode Available | 1 |
| COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test Suite | Mar 15, 2019 | Benchmarking | CodeCode Available | 1 |
| CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation | Nov 10, 2023 | BenchmarkingCloud Computing | CodeCode Available | 1 |
| Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics | Aug 2, 2024 | Adversarial AttackAdversarial Purification | CodeCode Available | 1 |
| Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform | Oct 12, 2021 | Benchmarking | CodeCode Available | 1 |