| I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench | Jan 31, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 4 |
| Molecular-driven Foundation Model for Oncologic Pathology | Jan 28, 2025 | BenchmarkingDiagnostic | CodeCode Available | 4 |
| shapiq: Shapley Interactions for Machine Learning | Oct 2, 2024 | BenchmarkingData Valuation | CodeCode Available | 4 |
| Benchmarking Automatic Machine Learning Frameworks | Aug 17, 2018 | Automated Feature EngineeringAutoML | CodeCode Available | 3 |
| Advancing LLM Reasoning Generalists with Preference Trees | Apr 2, 2024 | BenchmarkingCode Generation | CodeCode Available | 3 |
| Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making | Oct 9, 2024 | BenchmarkingDecision Making | CodeCode Available | 3 |
| Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving | May 27, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 3 |
| DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks | Jun 13, 2024 | Benchmarking | CodeCode Available | 3 |
| DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents | Jun 10, 2024 | Benchmarkingscientific discovery | CodeCode Available | 3 |
| CORL: Research-oriented Deep Offline Reinforcement Learning Library | Oct 13, 2022 | BenchmarkingD4RL | CodeCode Available | 3 |