| DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation | Jun 22, 2022 | BenchmarkingRecommendation Systems | CodeCode Available | 2 | 5 |
| Benchmarking Robustness of 3D Point Cloud Recognition Against Common Corruptions | Jan 28, 2022 | 3D Point Cloud Classification3D Point Cloud Data Augmentation | CodeCode Available | 2 | 5 |
| PEDANTS: Cheap but Effective and Interpretable Answer Equivalence | Feb 17, 2024 | BenchmarkingForm | CodeCode Available | 2 | 5 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 | 5 |
| Benchmarking Benchmark Leakage in Large Language Models | Apr 29, 2024 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 | 5 |
| ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling | Jul 4, 2023 | BenchmarkingWeather Forecasting | CodeCode Available | 2 | 5 |
| CausalGym: Benchmarking causal interpretability methods on linguistic tasks | Feb 19, 2024 | BenchmarkingInterpretability Techniques for Deep Learning | CodeCode Available | 2 | 5 |
| Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework | Apr 2, 2025 | BenchmarkingSynthetic Data Generation | CodeCode Available | 2 | 5 |
| PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs | Jun 15, 2023 | Benchmarking | CodeCode Available | 2 | 5 |
| Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations | Jun 9, 2022 | Benchmarkingcontinuous-control | CodeCode Available | 2 | 5 |