| Alpha Excel Benchmark | May 7, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking LLMs' Swarm intelligence | May 7, 2025 | Benchmarking | CodeCode Available | 1 |
| Call for Action: towards the next generation of symbolic regression benchmark | May 6, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models | May 6, 2025 | BenchmarkingImage Generation | CodeCode Available | 0 |
| CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics | May 6, 2025 | Benchmarking | CodeCode Available | 1 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach | May 6, 2025 | BenchmarkingEarth Observation | CodeCode Available | 0 |
| Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking | May 5, 2025 | BenchmarkingPrediction | —Unverified | 0 |
| FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models | May 5, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 |
| Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning | May 5, 2025 | Benchmarking | —Unverified | 0 |