| Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents | May 30, 2025 | BenchmarkingBlocking | CodeCode Available | 2 |
| VERINA: Benchmarking Verifiable Code Generation | May 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms | May 27, 2025 | Bayesian OptimizationBenchmarking | CodeCode Available | 2 |
| Benchmarking Laparoscopic Surgical Image Restoration and Beyond | May 25, 2025 | BenchmarkingImage Restoration | CodeCode Available | 2 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 |
| GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification | May 18, 2025 | Benchmarking | CodeCode Available | 2 |
| MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly | May 15, 2025 | 8kBenchmarking | CodeCode Available | 2 |
| Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement | May 13, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models | May 5, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 |
| MINERVA: Evaluating Complex Video Reasoning | May 1, 2025 | BenchmarkingTemporal Localization | CodeCode Available | 2 |