| Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents | May 30, 2025 | BenchmarkingBlocking | CodeCode Available | 2 |
| VERINA: Benchmarking Verifiable Code Generation | May 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms | May 27, 2025 | Bayesian OptimizationBenchmarking | CodeCode Available | 2 |
| Benchmarking Laparoscopic Surgical Image Restoration and Beyond | May 25, 2025 | BenchmarkingImage Restoration | CodeCode Available | 2 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 |
| GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification | May 18, 2025 | Benchmarking | CodeCode Available | 2 |
| MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly | May 15, 2025 | 8kBenchmarking | CodeCode Available | 2 |
| Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement | May 13, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models | May 5, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 |
| MINERVA: Evaluating Complex Video Reasoning | May 1, 2025 | BenchmarkingTemporal Localization | CodeCode Available | 2 |
| Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook | May 1, 2025 | BenchmarkingChange Detection | CodeCode Available | 2 |
| BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese | Apr 27, 2025 | BenchmarkingProper Noun | CodeCode Available | 2 |
| WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks | Apr 22, 2025 | Benchmarking | CodeCode Available | 2 |
| Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale | Apr 19, 2025 | Benchmarking | CodeCode Available | 2 |
| HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation | Apr 15, 2025 | Benchmarkingscientific discovery | CodeCode Available | 2 |
| TorchFX: A modern approach to Audio DSP with PyTorch and GPU acceleration | Apr 11, 2025 | Audio Signal ProcessingBenchmarking | CodeCode Available | 2 |
| Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing | Apr 3, 2025 | BenchmarkingLogical Reasoning | CodeCode Available | 2 |
| Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework | Apr 2, 2025 | BenchmarkingSynthetic Data Generation | CodeCode Available | 2 |
| Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer | Mar 21, 2025 | BenchmarkingVideo Generation | CodeCode Available | 2 |
| VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning | Mar 19, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning | Mar 10, 2025 | BenchmarkingMedical Question Answering | CodeCode Available | 2 |
| Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark | Mar 10, 2025 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Medical Hallucinations in Foundation Models and Their Impact on Healthcare | Feb 26, 2025 | BenchmarkingHallucination | CodeCode Available | 2 |
| Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts | Feb 24, 2025 | BenchmarkingFact Verification | CodeCode Available | 2 |
| TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators | Feb 20, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |