| CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methods | Aug 2, 2022 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games | Aug 25, 2021 | BenchmarkingVideo Classification | CodeCode Available | 1 |
| CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling | Jan 21, 2024 | Benchmarking | CodeCode Available | 1 |
| On the Detectability of ChatGPT Content: Benchmarking, Methodology, and Evaluation through the Lens of Academic Writing | Jun 7, 2023 | BenchmarkingPrompt Engineering | CodeCode Available | 1 |
| CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning Robustness | Jul 13, 2020 | Benchmarking | CodeCode Available | 1 |
| Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive? | Jun 15, 2023 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| Active-Passive SimStereo -- Benchmarking the Cross-Generalization Capabilities of Deep Learning-based Stereo Methods | Sep 17, 2022 | BenchmarkingStereo Matching | CodeCode Available | 1 |
| Chaos as an interpretable benchmark for forecasting and data-driven modelling | Oct 11, 2021 | BenchmarkingSymbolic Regression | CodeCode Available | 1 |
| MatTools: Benchmarking Large Language Models for Materials Science Tools | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| CharacterBench: Benchmarking Character Customization of Large Language Models | Dec 16, 2024 | Benchmarking | CodeCode Available | 1 |