| DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios | Oct 31, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 |
| OpenDataVal: a Unified Benchmark for Data Valuation | Jun 18, 2023 | BenchmarkingData Valuation | CodeCode Available | 1 |
| Dataset and Benchmark: Novel Sensors for Autonomous Vehicle Perception | Jan 24, 2024 | Benchmarking | CodeCode Available | 1 |
| Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets | Apr 11, 2022 | Action Triplet RecognitionBenchmarking | CodeCode Available | 1 |
| Data Generating Process to Evaluate Causal Discovery Techniques for Time Series Data | Apr 16, 2021 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| BiBench: Benchmarking and Analyzing Network Binarization | Jan 26, 2023 | BenchmarkingBinarization | CodeCode Available | 1 |
| BEND: Benchmarking DNA Language Models on biologically meaningful tasks | Nov 21, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems | Oct 30, 2024 | BenchmarkingManagement | CodeCode Available | 1 |
| DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation | Oct 11, 2022 | 6D Pose Estimation6D Pose Estimation using RGB | CodeCode Available | 1 |
| DACBench: A Benchmark Library for Dynamic Algorithm Configuration | May 18, 2021 | Benchmarking | CodeCode Available | 1 |
| AQuA: A Benchmarking Tool for Label Quality Assessment | Jun 15, 2023 | BenchmarkingLabel Error Detection | CodeCode Available | 1 |
| FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models | Jan 1, 2024 | Benchmarking | CodeCode Available | 1 |
| APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond | Dec 25, 2023 | Animal Pose EstimationBenchmarking | CodeCode Available | 1 |
| D2S: Document-to-Slide Generation Via Query-Based Text Summarization | May 8, 2021 | BenchmarkingLong Form Question Answering | CodeCode Available | 1 |
| Optimizing Performance of Federated Person Re-identification: Benchmarking and Analysis | May 24, 2022 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| OPV2V: An Open Benchmark Dataset and Fusion Pipeline for Perception with Vehicle-to-Vehicle Communication | Sep 16, 2021 | 3D Object DetectionBenchmarking | CodeCode Available | 1 |
| OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification | Apr 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Data-Driven Denoising of Stationary Accelerometer Signals | Jun 13, 2022 | BenchmarkingDenoising | CodeCode Available | 1 |
| Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models | May 19, 2025 | BenchmarkingChatbot | CodeCode Available | 1 |
| Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT | Jun 13, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 |
| CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer | Dec 2, 2021 | BenchmarkingOrdinal Classification | CodeCode Available | 1 |
| Curious Hierarchical Actor-Critic Reinforcement Learning | May 7, 2020 | BenchmarkingHierarchical Reinforcement Learning | CodeCode Available | 1 |
| CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks | Oct 23, 2023 | Benchmarking | CodeCode Available | 1 |
| COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test Suite | Mar 15, 2019 | Benchmarking | CodeCode Available | 1 |
| CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version) | Nov 19, 2022 | BenchmarkingC++ code | CodeCode Available | 1 |
| Benchmarking Graph Neural Networks on Dynamic Link Prediction | Sep 29, 2021 | BenchmarkingDynamic Link Prediction | CodeCode Available | 1 |
| Benchmarking Graph Neural Networks for FMRI analysis | Nov 16, 2022 | Benchmarking | CodeCode Available | 1 |
| Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs | Jun 22, 2023 | Arithmetic ReasoningBenchmarking | CodeCode Available | 1 |
| BiCo-Net: Regress Globally, Match Locally for Robust 6D Pose Estimation | May 7, 2022 | 6D Pose EstimationBenchmarking | CodeCode Available | 1 |
| ClearPose: Large-scale Transparent Object Dataset and Benchmark | Mar 8, 2022 | BenchmarkingDepth Completion | CodeCode Available | 1 |
| BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text | Apr 28, 2025 | Benchmarking | CodeCode Available | 1 |
| Performance Evaluation of Deep Transfer Learning on Multiclass Identification of Common Weed Species in Cotton Production Systems | Oct 11, 2021 | BenchmarkingManagement | CodeCode Available | 1 |
| PGDQN: Preference-Guided Deep Q-Network | Oct 3, 2023 | Atari GamesBenchmarking | CodeCode Available | 1 |
| Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation | Oct 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| Beyond neural scaling laws: beating power law scaling via data pruning | Jun 29, 2022 | Benchmarking | CodeCode Available | 1 |
| Beyond Normal: On the Evaluation of Mutual Information Estimators | Jun 19, 2023 | BenchmarkingDomain Generalization | CodeCode Available | 1 |
| CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Jan 2, 2025 | BenchmarkingComputer Security | CodeCode Available | 1 |
| dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal Processing | Apr 27, 2021 | BenchmarkingRetrieval | CodeCode Available | 1 |
| PLANTAIN: Diffusion-inspired Pose Score Minimization for Fast and Accurate Molecular Docking | Jul 22, 2023 | BenchmarkingMolecular Docking | CodeCode Available | 1 |
| Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph Engineering | Aug 31, 2023 | BenchmarkingDataset Generation | CodeCode Available | 1 |
| ECRECer: Enzyme Commission Number Recommendation and Benchmarking based on Multiagent Dual-core Learning | Feb 8, 2022 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy | Oct 23, 2020 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| RADIATE: A Radar Dataset for Automotive Perception in Bad Weather | Oct 18, 2020 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding | Jul 20, 2024 | BenchmarkingHeuristic Search | CodeCode Available | 1 |
| CLoG: Benchmarking Continual Learning of Image Generation Models | Jun 7, 2024 | BenchmarkingContinual Learning | CodeCode Available | 1 |
| Positional Encoding in Transformer-Based Time Series Models: A Survey | Feb 17, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| PowerMamba: A Deep State Space Model and Comprehensive Benchmark for Time Series Prediction in Electric Power Systems | Dec 9, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Benchmarking Graph Learning for Drug-Drug Interaction Prediction | Oct 24, 2024 | BenchmarkingGraph Learning | —Unverified | 0 |
| A practical generalization metric for deep networks benchmarking | Sep 2, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| AERF: Adaptive ensemble random fuzzy algorithm for anomaly detection in cloud computing | Jan 9, 2023 | Anomaly DetectionBenchmarking | —Unverified | 0 |