| Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents | May 30, 2025 | BenchmarkingBlocking | CodeCode Available | 2 |
| VERINA: Benchmarking Verifiable Code Generation | May 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms | May 27, 2025 | Bayesian OptimizationBenchmarking | CodeCode Available | 2 |
| Benchmarking Laparoscopic Surgical Image Restoration and Beyond | May 25, 2025 | BenchmarkingImage Restoration | CodeCode Available | 2 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 |
| GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification | May 18, 2025 | Benchmarking | CodeCode Available | 2 |
| MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly | May 15, 2025 | 8kBenchmarking | CodeCode Available | 2 |
| Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement | May 13, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models | May 5, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 |
| MINERVA: Evaluating Complex Video Reasoning | May 1, 2025 | BenchmarkingTemporal Localization | CodeCode Available | 2 |
| Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook | May 1, 2025 | BenchmarkingChange Detection | CodeCode Available | 2 |
| BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese | Apr 27, 2025 | BenchmarkingProper Noun | CodeCode Available | 2 |
| WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks | Apr 22, 2025 | Benchmarking | CodeCode Available | 2 |
| Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale | Apr 19, 2025 | Benchmarking | CodeCode Available | 2 |
| HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation | Apr 15, 2025 | Benchmarkingscientific discovery | CodeCode Available | 2 |
| TorchFX: A modern approach to Audio DSP with PyTorch and GPU acceleration | Apr 11, 2025 | Audio Signal ProcessingBenchmarking | CodeCode Available | 2 |
| Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing | Apr 3, 2025 | BenchmarkingLogical Reasoning | CodeCode Available | 2 |
| Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework | Apr 2, 2025 | BenchmarkingSynthetic Data Generation | CodeCode Available | 2 |
| Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer | Mar 21, 2025 | BenchmarkingVideo Generation | CodeCode Available | 2 |
| VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning | Mar 19, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning | Mar 10, 2025 | BenchmarkingMedical Question Answering | CodeCode Available | 2 |
| Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark | Mar 10, 2025 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Medical Hallucinations in Foundation Models and Their Impact on Healthcare | Feb 26, 2025 | BenchmarkingHallucination | CodeCode Available | 2 |
| Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts | Feb 24, 2025 | BenchmarkingFact Verification | CodeCode Available | 2 |
| TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators | Feb 20, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 |
| Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance | Feb 12, 2025 | BenchmarkingLong-Context Understanding | CodeCode Available | 2 |
| SoK: Benchmarking Poisoning Attacks and Defenses in Federated Learning | Feb 6, 2025 | BenchmarkingData Poisoning | CodeCode Available | 2 |
| Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation | Feb 5, 2025 | BenchmarkingLarge Language Model | CodeCode Available | 2 |
| SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model | Jan 28, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video | Jan 24, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 2 |
| OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | Jan 9, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 2 |
| nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark | Jan 1, 2025 | BenchmarkingImage Segmentation | CodeCode Available | 2 |
| An OpenMind for 3D medical vision self-supervised learning | Dec 22, 2024 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 2 |
| XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation | Dec 20, 2024 | BenchmarkingDiagnostic | CodeCode Available | 2 |
| AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving | Dec 19, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| Open Universal Arabic ASR Leaderboard | Dec 18, 2024 | Benchmarking | CodeCode Available | 2 |
| NeuralPLexer3: Accurate Biomolecular Complex Structure Prediction with Flow Models | Dec 14, 2024 | BenchmarkingDrug Design | CodeCode Available | 2 |
| EvalGIM: A Library for Evaluating Generative Image Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 2 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 |
| Video Quality Assessment: A Comprehensive Survey | Dec 4, 2024 | BenchmarkingSurvey | CodeCode Available | 2 |
| Commit0: Library Generation from Scratch | Dec 2, 2024 | BenchmarkingCode Generation | CodeCode Available | 2 |
| OpenQDC: Open Quantum Data Commons | Nov 29, 2024 | Benchmarking | CodeCode Available | 2 |
| GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks | Nov 28, 2024 | BenchmarkingObject Counting | CodeCode Available | 2 |
| HourVideo: 1-Hour Video-Language Understanding | Nov 7, 2024 | Benchmarkingcounterfactual | CodeCode Available | 2 |
| Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping | Nov 5, 2024 | BenchmarkingCode Generation | CodeCode Available | 2 |
| LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators | Oct 31, 2024 | BenchmarkingText Generation | CodeCode Available | 2 |
| InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models | Oct 30, 2024 | Benchmarking | CodeCode Available | 2 |
| CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation | Oct 30, 2024 | BenchmarkingPassage Retrieval | CodeCode Available | 2 |
| PC-Gym: Benchmark Environments For Process Control Problems | Oct 29, 2024 | BenchmarkingChemical Process | CodeCode Available | 2 |