| Benchmarking Neural Speech Codec Intelligibility with SITool | Jun 2, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices | Jun 2, 2025 | Benchmarking | —Unverified | 0 |
| ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists | Jun 2, 2025 | BenchmarkingForm | —Unverified | 0 |
| ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness | Jun 1, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book | Jun 1, 2025 | Benchmarking | CodeCode Available | 0 |
| ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models | Jun 1, 2025 | BenchmarkingRelational Reasoning | —Unverified | 0 |
| The iNaturalist Sounds Dataset | May 31, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Foundation Models for Zero-Shot Biometric Tasks | May 30, 2025 | AttributeBenchmarking | —Unverified | 0 |
| Geospatial Foundation Models to Enable Progress on Sustainable Development Goals | May 30, 2025 | BenchmarkingEarth Observation | —Unverified | 0 |
| GenSpace: Benchmarking Spatially-Aware Image Generation | May 30, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation | May 30, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents | May 30, 2025 | BenchmarkingCode Repair | —Unverified | 0 |
| Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| SORCE: Small Object Retrieval in Complex Environments | May 30, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |
| Segmenting France Across Four Centuries | May 30, 2025 | BenchmarkingImage-to-Image Translation | CodeCode Available | 0 |
| Automated Structured Radiology Report Generation | May 30, 2025 | Benchmarking | —Unverified | 0 |
| PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset | May 30, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 0 |
| PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models | May 30, 2025 | Benchmarking | —Unverified | 0 |
| Progressive Class-level Distillation | May 30, 2025 | BenchmarkingKnowledge Distillation | —Unverified | 0 |
| Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | May 30, 2025 | BenchmarkingCryptanalysis | —Unverified | 0 |
| Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs | May 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns | May 29, 2025 | Benchmarking | —Unverified | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation | May 29, 2025 | BenchmarkingImage Generation | —Unverified | 0 |