| Customizable Perturbation Synthesis for Robust SLAM Benchmarking | Feb 12, 2024 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 2 | 5 |
| A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends | Sep 29, 2024 | Benchmarkinggraph construction | CodeCode Available | 2 | 5 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 | 5 |
| DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation | Jun 22, 2022 | BenchmarkingRecommendation Systems | CodeCode Available | 2 | 5 |
| Craftium: An Extensible Framework for Creating Reinforcement Learning Environments | Jul 4, 2024 | BenchmarkingMinecraft | CodeCode Available | 2 | 5 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 | 5 |
| Datasets and Benchmarks for Offline Safe Reinforcement Learning | Jun 15, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 | 5 |
| EvalGIM: A Library for Evaluating Generative Image Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 2 | 5 |
| Commit0: Library Generation from Scratch | Dec 2, 2024 | BenchmarkingCode Generation | CodeCode Available | 2 | 5 |
| CoIR: A Comprehensive Benchmark for Code Information Retrieval Models | Jul 3, 2024 | BenchmarkingCode Search | CodeCode Available | 2 | 5 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 | 5 |
| COALA: A Practical and Vision-Centric Federated Learning Platform | Jul 23, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 | 5 |
| Benchmarking Potential Based Rewards for Learning Humanoid Locomotion | Jul 19, 2023 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 2 | 5 |
| Benchmarking Complex Instruction-Following with Multiple Constraints Composition | Jul 4, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 | 5 |
| CoqPilot, a plugin for LLM-based generation of proofs | Oct 25, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| Benchmarking Benchmark Leakage in Large Language Models | Apr 29, 2024 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 | 5 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 | 5 |
| OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception | Mar 7, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 | 5 |
| ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling | Jul 4, 2023 | BenchmarkingWeather Forecasting | CodeCode Available | 2 | 5 |
| A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning | Sep 26, 2023 | BenchmarkingMulti-Objective Reinforcement Learning | CodeCode Available | 2 | 5 |
| CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation | Oct 30, 2024 | BenchmarkingPassage Retrieval | CodeCode Available | 2 | 5 |
| CausalGym: Benchmarking causal interpretability methods on linguistic tasks | Feb 19, 2024 | BenchmarkingInterpretability Techniques for Deep Learning | CodeCode Available | 2 | 5 |
| Benchmarking and Improving Detail Image Caption | May 29, 2024 | BenchmarkingImage Captioning | CodeCode Available | 2 | 5 |
| Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations | Jun 9, 2022 | Benchmarkingcontinuous-control | CodeCode Available | 2 | 5 |
| Building Normalizing Flows with Stochastic Interpolants | Sep 30, 2022 | BenchmarkingDensity Estimation | CodeCode Available | 2 | 5 |