| OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models | Oct 2, 2024 | Benchmarking | CodeCode Available | 3 | 5 |
| GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation | Jun 19, 2024 | BenchmarkingImage Generation | CodeCode Available | 3 | 5 |
| AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models | May 22, 2025 | BenchmarkingFairness | CodeCode Available | 3 | 5 |
| A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs | Mar 14, 2022 | BenchmarkingGraph Embedding | CodeCode Available | 3 | 5 |
| Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis | Oct 9, 2023 | BenchmarkingMultivariate Time Series Forecasting | CodeCode Available | 3 | 5 |
| General Geospatial Inference with a Population Dynamics Foundation Model | Nov 11, 2024 | BenchmarkingGraph Neural Network | CodeCode Available | 3 | 5 |
| A Survey on Performance Metrics for Object-Detection Algorithms | Jul 21, 2020 | BenchmarkingObject | CodeCode Available | 3 | 5 |
| AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents | Oct 31, 2024 | Benchmarking | CodeCode Available | 3 | 5 |
| AER: Auto-Encoder with Regression for Time Series Anomaly Detection | Dec 27, 2022 | Anomaly DetectionBenchmarking | CodeCode Available | 3 | 5 |
| AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents | Jan 24, 2024 | Benchmarking | CodeCode Available | 3 | 5 |