| CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks | Oct 23, 2023 | Benchmarking | CodeCode Available | 1 |
| CriticBench: Benchmarking LLMs for Critique-Correct Reasoning | Feb 22, 2024 | Benchmarking | CodeCode Available | 1 |
| CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version) | Nov 19, 2022 | BenchmarkingC++ code | CodeCode Available | 1 |
| Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient | Jul 3, 2020 | BenchmarkingMuJoCo | CodeCode Available | 1 |
| MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts | Feb 14, 2022 | Benchmarking | CodeCode Available | 1 |
| Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital Pathology | Jun 30, 2022 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| DACBench: A Benchmark Library for Dynamic Algorithm Configuration | May 18, 2021 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Image Retrieval for Visual Localization | Nov 24, 2020 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection | May 30, 2022 | 3D Object DetectionAutonomous Driving | CodeCode Available | 1 |
| ArabicaQA: A Comprehensive Dataset for Arabic Question Answering | Mar 26, 2024 | BenchmarkingMachine Reading Comprehension | CodeCode Available | 1 |
| MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs | Jan 1, 2023 | BenchmarkingGPU | CodeCode Available | 1 |
| COVID-19 event extraction from Twitter via extractive question answering with continuous prompts | Mar 19, 2023 | BenchmarkingEvent Extraction | CodeCode Available | 1 |
| MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | May 26, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models | Mar 24, 2022 | BenchmarkingSentence | CodeCode Available | 1 |
| Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions | Oct 13, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation | Dec 26, 2019 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets | Dec 10, 2021 | Benchmarking | CodeCode Available | 1 |
| Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans | Jan 14, 2021 | BenchmarkingMedical Diagnosis | CodeCode Available | 1 |
| MLLM-DataEngine: An Iterative Refinement Approach for MLLM | Aug 25, 2023 | Benchmarking | CodeCode Available | 1 |
| CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions | Jun 26, 2025 | BenchmarkingDrug Design | CodeCode Available | 1 |
| CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling | Oct 14, 2022 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| ByzFL: Research Framework for Robust Federated Learning | May 30, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning | May 30, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasks | Feb 4, 2023 | Adversarial AttackAdversarial Robustness | CodeCode Available | 1 |
| scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data | Jun 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 |