| MineRL: A Large-Scale Dataset of Minecraft Demonstrations | Jul 29, 2019 | BenchmarkingDeep Reinforcement Learning | CodeCode Available | 0 |
| GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data | Feb 22, 2024 | Benchmarking | CodeCode Available | 0 |
| GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations | Jun 17, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling | Mar 24, 2025 | BenchmarkingOpenAI Gym | CodeCode Available | 0 |
| Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal Carcinoma | Oct 4, 2023 | BenchmarkingSegmentation | CodeCode Available | 0 |
| MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context Understanding | Sep 10, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Mirage: Model-Agnostic Graph Distillation for Graph Classification | Oct 14, 2023 | BenchmarkingClassification | CodeCode Available | 0 |
| Benchmarking Subset Selection from Large Candidate Solution Sets in Evolutionary Multi-objective Optimization | Jan 18, 2022 | Benchmarking | CodeCode Available | 0 |
| Sanity Simulations for Saliency Methods | May 13, 2021 | Benchmarking | CodeCode Available | 0 |
| From Variability to Stability: Advancing RecSys Benchmarking Practices | Feb 15, 2024 | BenchmarkingCollaborative Filtering | CodeCode Available | 0 |