| BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text | Apr 28, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement | Apr 22, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| LEMUR Neural Network Dataset: Towards Seamless AutoML | Apr 14, 2025 | AutoMLBenchmarking | CodeCode Available | 1 |
| TinyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic Programming | Apr 14, 2025 | BenchmarkingProgram Synthesis | CodeCode Available | 1 |
| LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs | Apr 11, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Evolutionary Generation of Random Surreal Numbers for Benchmarking | Apr 9, 2025 | Benchmarking | CodeCode Available | 1 |
| An Empirical Study of GPT-4o Image Generation Capabilities | Apr 8, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models | Apr 8, 2025 | BenchmarkingVisual Reasoning | CodeCode Available | 1 |
| CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization | Apr 6, 2025 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |