| OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | Apr 11, 2024 | Benchmarking | CodeCode Available | 7 | 5 |
| TaskBench: Benchmarking Large Language Models for Task Automation | Nov 30, 2023 | BenchmarkingParameter Prediction | CodeCode Available | 6 | 5 |
| CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X | Mar 30, 2023 | BenchmarkingCode Generation | CodeCode Available | 5 | 5 |
| AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance | Jun 4, 2025 | BenchmarkingScheduling | CodeCode Available | 5 | 5 |
| TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods | Mar 29, 2024 | BenchmarkingMultivariate Time Series Forecasting | CodeCode Available | 5 | 5 |
| The BrowserGym Ecosystem for Web Agent Research | Dec 6, 2024 | Benchmarking | CodeCode Available | 5 | 5 |
| SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation | Jan 16, 2025 | Benchmarking | CodeCode Available | 5 | 5 |
| Segment Anything Model for Medical Image Segmentation: Current Applications and Future Directions | Jan 7, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 5 | 5 |
| OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations | Dec 10, 2024 | AttributeBenchmarking | CodeCode Available | 5 | 5 |
| Benchmarking the Myopic Trap: Positional Bias in Information Retrieval | May 20, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 5 | 5 |