| Explainable Benchmarking for Iterative Optimization Heuristics | Jan 31, 2024 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 1 | 5 |
| Explainable Global Wildfire Prediction Models using Graph Neural Networks | Feb 11, 2024 | BenchmarkingCommunity Detection | CodeCode Available | 1 | 5 |
| Learning Representations with Contrastive Self-Supervised Learning for Histopathology Applications | Dec 10, 2021 | BenchmarkingContrastive Learning | CodeCode Available | 1 | 5 |
| BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Apr 2, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 1 | 5 |
| scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data | Jun 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 | 5 |
| Bag of Tricks for Adversarial Training | Oct 1, 2020 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 | 5 |
| Biomedical Data-to-Text Generation via Fine-Tuning Transformers | Sep 3, 2021 | BenchmarkingData-to-Text Generation | CodeCode Available | 1 | 5 |
| Exploring Large Language Models for Classical Philology | May 23, 2023 | BenchmarkingDecoder | CodeCode Available | 1 | 5 |
| BioRED: A Rich Biomedical Relation Extraction Dataset | Apr 8, 2022 | BenchmarkingBinary Relation Extraction | CodeCode Available | 1 | 5 |
| BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning | Feb 23, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data | Jul 3, 2025 | BenchmarkingRepresentation Learning | CodeCode Available | 1 | 5 |
| S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations | Oct 12, 2021 | BenchmarkingVoice Conversion | CodeCode Available | 1 | 5 |
| LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation | Nov 4, 2024 | BenchmarkingGraph Generation | CodeCode Available | 1 | 5 |
| Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts | Nov 7, 2023 | BenchmarkingMachine Translation | CodeCode Available | 1 | 5 |
| AQuA: A Benchmarking Tool for Label Quality Assessment | Jun 15, 2023 | BenchmarkingLabel Error Detection | CodeCode Available | 1 | 5 |
| Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed | May 27, 2022 | BenchmarkingBinary Classification | CodeCode Available | 1 | 5 |
| Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models | Dec 15, 2023 | BenchmarkingCode Summarization | CodeCode Available | 1 | 5 |
| Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking | May 28, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| ScandEval: A Benchmark for Scandinavian Natural Language Processing | Apr 3, 2023 | BenchmarkingCross-Lingual Transfer | CodeCode Available | 1 | 5 |
| APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond | Dec 25, 2023 | Animal Pose EstimationBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking large language models for biomedical natural language processing applications and recommendations | May 10, 2023 | BenchmarkingDocument Classification | CodeCode Available | 1 | 5 |
| Quantum machine learning of large datasets using randomized measurements | Aug 2, 2021 | BenchmarkingBIG-bench Machine Learning | CodeCode Available | 1 | 5 |
| MatTools: Benchmarking Large Language Models for Materials Science Tools | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies | Jul 22, 2024 | BenchmarkingOut-of-Distribution Generalization | CodeCode Available | 1 | 5 |
| Fast hyperboloid decision tree algorithms | Oct 20, 2023 | BenchmarkingRiemannian optimization | CodeCode Available | 1 | 5 |
| BiCo-Net: Regress Globally, Match Locally for Robust 6D Pose Estimation | May 7, 2022 | 6D Pose EstimationBenchmarking | CodeCode Available | 1 | 5 |
| Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models | Jul 16, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| BiBench: Benchmarking and Analyzing Network Binarization | Jan 26, 2023 | BenchmarkingBinarization | CodeCode Available | 1 | 5 |
| FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models | Jan 1, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots | Sep 16, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| ScrewNet: Category-Independent Articulation Model Estimation From Depth Images Using Screw Theory | Aug 24, 2020 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Graph Neural Networks on Dynamic Link Prediction | Sep 29, 2021 | BenchmarkingDynamic Link Prediction | CodeCode Available | 1 | 5 |
| Benchmarking Graph Neural Networks for FMRI analysis | Nov 16, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Beyond neural scaling laws: beating power law scaling via data pruning | Jun 29, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Beyond Normal: On the Evaluation of Mutual Information Estimators | Jun 19, 2023 | BenchmarkingDomain Generalization | CodeCode Available | 1 | 5 |
| Formalizing Multimedia Recommendation through Multimodal Deep Learning | Sep 11, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking Suite | Sep 28, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Large Language Models for Multi-Robot Systems: A Survey | Feb 6, 2025 | Action GenerationBenchmarking | CodeCode Available | 1 | 5 |
| LEAF: A Benchmark for Federated Settings | Dec 3, 2018 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 | 5 |
| LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models | Nov 1, 2024 | BenchmarkingMixture-of-Experts | CodeCode Available | 1 | 5 |
| Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks | Nov 25, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 | 5 |
| MIRFLEX: Music Information Retrieval Feature Library for Extraction | Nov 1, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 1 | 5 |
| FELM: Benchmarking Factuality Evaluation of Large Language Models | Oct 1, 2023 | BenchmarkingMath | CodeCode Available | 1 | 5 |
| Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking | Jun 9, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 | 5 |
| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| FiFAR: A Fraud Detection Dataset for Learning to Defer | Dec 20, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness | Jun 1, 2025 | BenchmarkingManagement | CodeCode Available | 0 | 5 |
| Conformal Prediction: A Theoretical Note and Benchmarking Transductive Node Classification in Graphs | Sep 26, 2024 | BenchmarkingConformal Prediction | CodeCode Available | 0 | 5 |
| Knowledge-Driven Slot Constraints for Goal-Oriented Dialogue Systems | Jun 1, 2021 | BenchmarkingGoal-Oriented Dialogue Systems | CodeCode Available | 0 | 5 |