| TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction | Nov 16, 2023 | BenchmarkingEvent Extraction | CodeCode Available | 1 |
| CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasks | Feb 4, 2023 | Adversarial AttackAdversarial Robustness | CodeCode Available | 1 |
| Benchmarking Language Model Creativity: A Case Study on Code Generation | Jul 12, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Benchmarking emergency department triage prediction models with machine learning and large public electronic health records | Nov 22, 2021 | Benchmarking | CodeCode Available | 1 |
| 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs | Apr 28, 2024 | Benchmarking | CodeCode Available | 1 |
| GuacaMol: Benchmarking Models for De Novo Molecular Design | Nov 22, 2018 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling | Jun 10, 2025 | Benchmarking | CodeCode Available | 1 |
| Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning Algorithms | Nov 30, 2023 | BenchmarkingOpenAI Gym | CodeCode Available | 1 |
| ConsumerBench: Benchmarking Generative AI Applications on End-User Devices | Jun 21, 2025 | BenchmarkingCPU | CodeCode Available | 1 |
| Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate | Aug 12, 2021 | Benchmarking | CodeCode Available | 1 |
| A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research Challenges | Oct 21, 2022 | BenchmarkingCommunity Detection | CodeCode Available | 1 |
| HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing | Sep 25, 2024 | BenchmarkingImage Dehazing | CodeCode Available | 1 |
| Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation | Feb 18, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| HINT3: Raising the bar for Intent Detection in the Wild | Sep 29, 2020 | BenchmarkingIntent Detection | CodeCode Available | 1 |
| Contemporary Symbolic Regression Methods and their Relative Performance | Jul 29, 2021 | Benchmarkingparameter estimation | CodeCode Available | 1 |
| Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide Images | Mar 15, 2024 | BenchmarkingKnowledge Distillation | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| Benchmarking Quantized Neural Networks on FPGAs with FINN | Feb 2, 2021 | BenchmarkingQuantization | CodeCode Available | 1 |
| Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation | Dec 26, 2019 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models | Aug 29, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 |
| Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks | Jun 14, 2020 | BenchmarkingDeep Reinforcement Learning | CodeCode Available | 1 |
| ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies | Jun 15, 2025 | Benchmarking | CodeCode Available | 1 |
| CommonPower: A Framework for Safe Data-Driven Smart Grid Control | Jun 5, 2024 | Benchmarkingenergy management | CodeCode Available | 1 |
| HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive Media | Oct 14, 2021 | 3D Pose EstimationBenchmarking | CodeCode Available | 1 |
| 4D Panoptic LiDAR Segmentation | Feb 24, 2021 | 4D Panoptic SegmentationBenchmarking | CodeCode Available | 1 |
| A framework for benchmarking clustering algorithms | Sep 20, 2022 | BenchmarkingClustering | CodeCode Available | 1 |
| CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification | Jun 18, 2023 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Hyperparameter optimization in deep multi-target prediction | Nov 8, 2022 | AutoMLBenchmarking | CodeCode Available | 1 |
| Comprehensive benchmarking of large language models for RNA secondary structure prediction | Oct 21, 2024 | Benchmarking | CodeCode Available | 1 |
| Combinatorial Optimization with Policy Adaptation using Latent Space Search | Nov 13, 2023 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining | Nov 22, 2017 | Benchmarkingfeature selection | CodeCode Available | 1 |
| Image Colorization: A Survey and Dataset | Aug 25, 2020 | BenchmarkingColorization | CodeCode Available | 1 |
| Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantification | Nov 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| A SWAT-based Reinforcement Learning Framework for Crop Management | Feb 10, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 |
| AirSim Drone Racing Lab | Mar 12, 2020 | BenchmarkingOptical Flow Estimation | CodeCode Available | 1 |
| Comics Datasets Framework: Mix of Comics datasets for detection benchmarking | Jul 3, 2024 | BenchmarkingObject | CodeCode Available | 1 |
| Benchmarking Simulation-Based Inference | Jan 12, 2021 | Benchmarking | CodeCode Available | 1 |
| A Comprehensive Overview of Large Language Models | Jul 12, 2023 | Benchmarking | CodeCode Available | 1 |
| A framework for benchmarking class-out-of-distribution detection and its application to ImageNet | Feb 23, 2023 | BenchmarkingKnowledge Distillation | CodeCode Available | 1 |
| CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics | May 6, 2025 | Benchmarking | CodeCode Available | 1 |
| Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban Intersection | Apr 25, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 |
| Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model | Apr 10, 2024 | BenchmarkingImage-to-Image Translation | CodeCode Available | 1 |
| A Systematic Benchmarking Analysis of Transfer Learning for Medical Image Analysis | Aug 12, 2021 | BenchmarkingMedical Image Analysis | CodeCode Available | 1 |
| Improving and Benchmarking Offline Reinforcement Learning Algorithms | Jun 1, 2023 | AttributeBenchmarking | CodeCode Available | 1 |
| CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions | Jun 26, 2025 | BenchmarkingDrug Design | CodeCode Available | 1 |
| DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems | Oct 30, 2024 | BenchmarkingManagement | CodeCode Available | 1 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |
| Geometric Deep Learning for Structure-Based Drug Design: A Survey | Jun 20, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| CodeS: Natural Language to Code Repository via Multi-Layer Sketch | Mar 25, 2024 | Benchmarking | CodeCode Available | 1 |
| CoDEx: A Comprehensive Knowledge Graph Completion Benchmark | Sep 16, 2020 | BenchmarkingKnowledge Graph Completion | CodeCode Available | 1 |