| POPGym: Benchmarking Partially Observable Reinforcement Learning | Mar 3, 2023 | BenchmarkingGPU | CodeCode Available | 2 |
| Fortuna: A Library for Uncertainty Quantification in Deep Learning | Feb 8, 2023 | Bayesian InferenceBenchmarking | CodeCode Available | 2 |
| Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint) | Jan 14, 2023 | Benchmarking | CodeCode Available | 2 |
| Benchmarking the Robustness of LiDAR Semantic Segmentation Models | Jan 3, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method | Dec 22, 2022 | 4k8k | CodeCode Available | 2 |
| PyPop7: A Pure-Python Library for Population-Based Black-Box Optimization | Dec 12, 2022 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 2 |
| Why do tree-based models still outperform deep learning on typical tabular data? | Nov 28, 2022 | Benchmarking | CodeCode Available | 2 |
| Immersive Neural Graphics Primitives | Nov 24, 2022 | BenchmarkingNeRF | CodeCode Available | 2 |
| LaMAR: Benchmarking Localization and Mapping for Augmented Reality | Oct 19, 2022 | BenchmarkingDiversity | CodeCode Available | 2 |
| rPPG-Toolbox: Deep Remote PPG Toolbox | Oct 3, 2022 | BenchmarkingData Augmentation | CodeCode Available | 2 |
| Building Normalizing Flows with Stochastic Interpolants | Sep 30, 2022 | BenchmarkingDensity Estimation | CodeCode Available | 2 |
| State-specific protein-ligand complex structure prediction with a multi-scale deep generative model | Sep 30, 2022 | BenchmarkingBlind Docking | CodeCode Available | 2 |
| MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation | Aug 17, 2022 | BenchmarkingCode Generation | CodeCode Available | 2 |
| Panoptic Scene Graph Generation | Jul 22, 2022 | BenchmarkingPanoptic Scene Graph Generation | CodeCode Available | 2 |
| Why do tree-based models still outperform deep learning on tabular data? | Jul 18, 2022 | Benchmarking | CodeCode Available | 2 |
| VMAS: A Vectorized Multi-Agent Simulator for Collective Robot Learning | Jul 7, 2022 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding | Jul 4, 2022 | BenchmarkingDocument Ranking | CodeCode Available | 2 |
| The ArtBench Dataset: Benchmarking Generative Models with Artworks | Jun 22, 2022 | BenchmarkingConditional Image Generation | CodeCode Available | 2 |
| DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation | Jun 22, 2022 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations | Jun 9, 2022 | Benchmarkingcontinuous-control | CodeCode Available | 2 |
| Fast Vision Transformers with HiLo Attention | May 26, 2022 | BenchmarkingEfficient ViTs | CodeCode Available | 2 |
| BARS: Towards Open Benchmarking for Recommender Systems | May 19, 2022 | BenchmarkingClick-Through Rate Prediction | CodeCode Available | 2 |
| K-LITE: Learning Transferable Visual Models with External Knowledge | Apr 20, 2022 | BenchmarkingDescriptive | CodeCode Available | 2 |
| Deep Visual Geo-localization Benchmark | Apr 7, 2022 | BenchmarkingData Augmentation | CodeCode Available | 2 |
| Multi-Class Road User Detection With 3+1D Radar in the View-of-Delft Dataset | Apr 1, 2022 | 3D Object DetectionBenchmarking | CodeCode Available | 2 |
| ADATIME: A Benchmarking Suite for Domain Adaptation on Time Series Data | Mar 15, 2022 | BenchmarkingDomain Adaptation | CodeCode Available | 2 |
| Benchmarking Robustness of 3D Point Cloud Recognition Against Common Corruptions | Jan 28, 2022 | 3D Point Cloud Classification3D Point Cloud Data Augmentation | CodeCode Available | 2 |
| AiTLAS: Artificial Intelligence Toolbox for Earth Observation | Jan 21, 2022 | BenchmarkingEarth Observation | CodeCode Available | 2 |
| Investigating Tradeoffs in Real-World Video Super-Resolution | Nov 24, 2021 | BenchmarkingSuper-Resolution | CodeCode Available | 2 |
| Multitask Prompted Training Enables Zero-Shot Task Generalization | Oct 15, 2021 | BenchmarkingDecoder | CodeCode Available | 2 |
| MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning | Sep 26, 2021 | BenchmarkingDecision Making | CodeCode Available | 2 |
| Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking | Sep 8, 2021 | BenchmarkingDiversity | CodeCode Available | 2 |
| BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models | Apr 17, 2021 | Argument RetrievalBenchmarking | CodeCode Available | 2 |
| Learning to Fly -- a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter Control | Mar 3, 2021 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| Learning Transferable Visual Models From Natural Language Supervision | Feb 26, 2021 | Action RecognitionBenchmarking | CodeCode Available | 2 |
| Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details | Feb 1, 2021 | Benchmarkingobject-detection | CodeCode Available | 2 |
| PyHealth: A Python Library for Health Predictive Models | Jan 11, 2021 | BenchmarkingBIG-bench Machine Learning | CodeCode Available | 2 |
| TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks | Sep 16, 2020 | Anomaly DetectionBenchmarking | CodeCode Available | 2 |
| Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples | Sep 9, 2020 | Adversarial TextBenchmarking | CodeCode Available | 2 |
| Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework | Jun 23, 2020 | BenchmarkingGPU | CodeCode Available | 2 |
| Benchmarking Graph Neural Networks | Mar 2, 2020 | BenchmarkingGraph Classification | CodeCode Available | 2 |
| Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach | Aug 31, 2019 | ArticlesBenchmarking | CodeCode Available | 2 |
| Habitat: A Platform for Embodied AI Research | Apr 2, 2019 | BenchmarkingGPU | CodeCode Available | 2 |
| Benchmarking Neural Network Robustness to Common Corruptions and Perturbations | Mar 28, 2019 | Adversarial DefenseBenchmarking | CodeCode Available | 2 |
| A large annotated medical image dataset for the development and evaluation of segmentation algorithms | Feb 25, 2019 | BenchmarkingSegmentation | CodeCode Available | 2 |
| Benchmarking Deep Reinforcement Learning for Continuous Control | Apr 22, 2016 | Action Triplet RecognitionAtari Games | CodeCode Available | 2 |
| LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models | Jul 5, 2025 | BenchmarkingGPU | CodeCode Available | 1 |
| Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data | Jul 3, 2025 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions | Jun 26, 2025 | BenchmarkingDrug Design | CodeCode Available | 1 |
| WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads | Jun 25, 2025 | Benchmarking | CodeCode Available | 1 |