| ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models | Jun 1, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 |
| CODEMENV: Benchmarking Large Language Models on Code Migration | Jun 1, 2025 | Benchmarking | CodeCode Available | 1 |
| ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness | Jun 1, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book | Jun 1, 2025 | Benchmarking | CodeCode Available | 0 |
| The iNaturalist Sounds Dataset | May 31, 2025 | Benchmarking | —Unverified | 0 |
| AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time | May 31, 2025 | BenchmarkingTest-time Adaptation | CodeCode Available | 1 |
| Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents | May 30, 2025 | BenchmarkingCode Repair | —Unverified | 0 |
| PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset | May 30, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 0 |
| Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| GenSpace: Benchmarking Spatially-Aware Image Generation | May 30, 2025 | BenchmarkingImage Generation | —Unverified | 0 |