| WebWalker: Benchmarking LLMs in Web Traversal | Jan 13, 2025 | BenchmarkingOpen-Domain Question Answering | CodeCode Available | 11 | 5 |
| StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models | Mar 12, 2024 | Benchmarking | CodeCode Available | 9 | 5 |
| EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning | Jan 25, 2025 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 7 | 5 |
| CALE: Continuous Arcade Learning Environment | Oct 31, 2024 | Atari GamesBenchmarking | CodeCode Available | 7 | 5 |
| ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? | Jul 19, 2024 | BenchmarkingCode Generation | CodeCode Available | 7 | 5 |
| DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference | Jan 9, 2024 | BenchmarkingText Generation | CodeCode Available | 7 | 5 |
| SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models | Feb 8, 2024 | BenchmarkingDiversity | CodeCode Available | 7 | 5 |
| NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking | Jun 21, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 7 | 5 |
| OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments | Apr 11, 2024 | Benchmarking | CodeCode Available | 7 | 5 |
| Segment Anything in Medical Images and Videos: Benchmark and Deployment | Aug 6, 2024 | BenchmarkingSegmentation | CodeCode Available | 7 | 5 |
| Better than classical? The subtle art of benchmarking quantum machine learning models | Mar 11, 2024 | BenchmarkingBinary Classification | CodeCode Available | 7 | 5 |
| TaskBench: Benchmarking Large Language Models for Task Automation | Nov 30, 2023 | BenchmarkingParameter Prediction | CodeCode Available | 6 | 5 |
| AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance | Jun 4, 2025 | BenchmarkingScheduling | CodeCode Available | 5 | 5 |
| CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X | Mar 30, 2023 | BenchmarkingCode Generation | CodeCode Available | 5 | 5 |
| TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods | Mar 29, 2024 | BenchmarkingMultivariate Time Series Forecasting | CodeCode Available | 5 | 5 |
| SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation | Jan 16, 2025 | Benchmarking | CodeCode Available | 5 | 5 |
| The BrowserGym Ecosystem for Web Agent Research | Dec 6, 2024 | Benchmarking | CodeCode Available | 5 | 5 |
| OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations | Dec 10, 2024 | AttributeBenchmarking | CodeCode Available | 5 | 5 |
| Benchmarking the Myopic Trap: Positional Bias in Information Retrieval | May 20, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 5 | 5 |
| Segment Anything Model for Medical Image Segmentation: Current Applications and Future Directions | Jan 7, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 5 | 5 |
| VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models | Nov 20, 2024 | BenchmarkingImage Generation | CodeCode Available | 5 | 5 |
| Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments | Jun 24, 2024 | Benchmarking | CodeCode Available | 4 | 5 |
| I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench | Jan 31, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 4 | 5 |
| Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders | Dec 23, 2024 | 3D Shape ModelingBenchmarking | CodeCode Available | 4 | 5 |
| LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit | May 9, 2024 | BenchmarkingComputational Efficiency | CodeCode Available | 4 | 5 |