| ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling | Jul 4, 2023 | BenchmarkingWeather Forecasting | CodeCode Available | 2 | 5 |
| Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework | Apr 2, 2025 | BenchmarkingSynthetic Data Generation | CodeCode Available | 2 | 5 |
| PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs | Jun 15, 2023 | Benchmarking | CodeCode Available | 2 | 5 |
| PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning | Jun 24, 2025 | BenchmarkingDrug Discovery | CodeCode Available | 2 | 5 |
| PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models | May 15, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| Commit0: Library Generation from Scratch | Dec 2, 2024 | BenchmarkingCode Generation | CodeCode Available | 2 | 5 |
| ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons | Oct 11, 2023 | BenchmarkingPosition | CodeCode Available | 2 | 5 |
| Benchmarking Complex Instruction-Following with Multiple Constraints Composition | Jul 4, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 | 5 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 | 5 |
| Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations | Jun 9, 2022 | Benchmarkingcontinuous-control | CodeCode Available | 2 | 5 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 | 5 |
| Benchmarking the Robustness of LiDAR Semantic Segmentation Models | Jan 3, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 | 5 |
| Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Jul 17, 2024 | BenchmarkingRed Teaming | CodeCode Available | 2 | 5 |
| Revealing data leakage in protein interaction benchmarks | Apr 16, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 | 5 |
| Learning to Fly -- a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter Control | Mar 3, 2021 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 | 5 |
| RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning | Apr 9, 2023 | BenchmarkingDeep Reinforcement Learning | CodeCode Available | 2 | 5 |
| REAL-Colon: A dataset for developing real-world AI applications in colonoscopy | Mar 4, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph | Jun 21, 2024 | BenchmarkingText Generation | CodeCode Available | 2 | 5 |
| BARS: Towards Open Benchmarking for Recommender Systems | May 19, 2022 | BenchmarkingClick-Through Rate Prediction | CodeCode Available | 2 | 5 |
| Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach | Aug 31, 2019 | ArticlesBenchmarking | CodeCode Available | 2 | 5 |
| COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning | Jan 15, 2021 | BenchmarkingMisinformation | CodeCode Available | 1 | 5 |
| Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial Labels | Jan 30, 2024 | Benchmarkingimage-classification | CodeCode Available | 1 | 5 |
| RADAR: Benchmarking Language Models on Imperfect Tabular Data | Jun 9, 2025 | BenchmarkingMissing Values | CodeCode Available | 1 | 5 |
| Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness Metrics | Jun 8, 2021 | Age And Gender ClassificationBenchmarking | CodeCode Available | 1 | 5 |