| CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection | Mar 12, 2025 | BenchmarkingCode Classification | CodeCode Available | 1 |
| Mukayese: Turkish NLP Strikes Back | Mar 2, 2022 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Data Generating Process to Evaluate Causal Discovery Techniques for Time Series Data | Apr 16, 2021 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| Decoding the Enigma: Benchmarking Humans and AIs on the Many Facets of Working Memory | Jul 20, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Benchmarking Image Retrieval for Visual Localization | Nov 24, 2020 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery | Oct 31, 2024 | BenchmarkingCloud Removal | CodeCode Available | 1 |
| ArabicaQA: A Comprehensive Dataset for Arabic Question Answering | Mar 26, 2024 | BenchmarkingMachine Reading Comprehension | CodeCode Available | 1 |
| Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets | Dec 10, 2021 | Benchmarking | CodeCode Available | 1 |
| Curious Hierarchical Actor-Critic Reinforcement Learning | May 7, 2020 | BenchmarkingHierarchical Reinforcement Learning | CodeCode Available | 1 |
| CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Jan 2, 2025 | BenchmarkingComputer Security | CodeCode Available | 1 |
| CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version) | Nov 19, 2022 | BenchmarkingC++ code | CodeCode Available | 1 |
| Multimodal LLMs Can Reason about Aesthetics in Zero-Shot | Jan 15, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution Imagery | Feb 18, 2022 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning | May 30, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer | Dec 2, 2021 | BenchmarkingOrdinal Classification | CodeCode Available | 1 |
| MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark | Oct 20, 2023 | Benchmarkingde-en | CodeCode Available | 1 |
| Mutual-Information Based Few-Shot Classification | Jun 23, 2021 | BenchmarkingClassification | CodeCode Available | 1 |
| NAS-Bench-101: Towards Reproducible Neural Architecture Search | Feb 25, 2019 | BenchmarkingNeural Architecture Search | CodeCode Available | 1 |
| BEND: Benchmarking DNA Language Models on biologically meaningful tasks | Nov 21, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| NAS-Bench-Graph: Benchmarking Graph Neural Architecture Search | Jun 18, 2022 | BenchmarkingGraph Neural Network | CodeCode Available | 1 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| NATS-Bench: Benchmarking NAS Algorithms for Architecture Topology and Size | Aug 28, 2020 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Autonomous Reinforcement Learning: Formalism and Benchmarking | Dec 17, 2021 | Benchmarkingreinforcement-learning | CodeCode Available | 1 |
| CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery | Oct 3, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data | Jun 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks | Oct 23, 2023 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments | May 8, 2025 | BenchmarkingPrompt Engineering | CodeCode Available | 1 |
| Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks | Nov 4, 2024 | Action GenerationBenchmarking | CodeCode Available | 1 |
| Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT | Jun 13, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 |
| D2S: Document-to-Slide Generation Via Query-Based Text Summarization | May 8, 2021 | BenchmarkingLong Form Question Answering | CodeCode Available | 1 |
| Decoding the Underlying Meaning of Multimodal Hateful Memes | May 28, 2023 | BenchmarkingHateful Meme Classification | CodeCode Available | 1 |
| A Critical Assessment of State-of-the-Art in Entity Alignment | Oct 30, 2020 | BenchmarkingEntity Alignment | CodeCode Available | 1 |
| Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset | Nov 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning Applications | Nov 4, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| COVID-19 event extraction from Twitter via extractive question answering with continuous prompts | Mar 19, 2023 | BenchmarkingEvent Extraction | CodeCode Available | 1 |
| NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation | Oct 2, 2023 | BenchmarkingNews Recommendation | CodeCode Available | 1 |
| NLPBench: Evaluating Large Language Models on Solving NLP Problems | Sep 27, 2023 | BenchmarkingMath | CodeCode Available | 1 |
| Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation | Dec 26, 2019 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions | Jun 26, 2025 | BenchmarkingDrug Design | CodeCode Available | 1 |
| CriticBench: Benchmarking LLMs for Critique-Correct Reasoning | Feb 22, 2024 | Benchmarking | CodeCode Available | 1 |
| NTIRE 2020 Challenge on Real-World Image Super-Resolution: Methods and Results | May 5, 2020 | BenchmarkingImage Super-Resolution | CodeCode Available | 1 |
| NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation | Feb 18, 2021 | BenchmarkingInterpretable Machine Learning | CodeCode Available | 1 |
| AQuA: A Benchmarking Tool for Label Quality Assessment | Jun 15, 2023 | BenchmarkingLabel Error Detection | CodeCode Available | 1 |
| Object Shape Error Response Using Bayesian 3-D Convolutional Neural Networks for Assembly Systems With Compliant Parts | Dec 8, 2021 | 3D Shape ModelingBenchmarking | CodeCode Available | 1 |
| CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasks | Feb 4, 2023 | Adversarial AttackAdversarial Robustness | CodeCode Available | 1 |
| Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms | Jul 8, 2021 | Benchmarking | CodeCode Available | 1 |
| APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond | Dec 25, 2023 | Animal Pose EstimationBenchmarking | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling | Jun 10, 2025 | Benchmarking | CodeCode Available | 1 |
| Contemporary Symbolic Regression Methods and their Relative Performance | Jul 29, 2021 | Benchmarkingparameter estimation | CodeCode Available | 1 |