| Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities | May 13, 2025 | automatic-speech-translationBenchmarking | —Unverified | 0 |
| ExEBench: Benchmarking Foundation Models on Extreme Earth Events | May 13, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior | May 13, 2025 | BenchmarkingSeismic Interpretation | —Unverified | 0 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement | May 13, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs | May 12, 2025 | BenchmarkingDocument Layout Analysis | —Unverified | 0 |
| PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints | May 12, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030 | May 12, 2025 | BenchmarkingEthics | —Unverified | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 |
| From raw affiliations to organization identifiers | May 12, 2025 | BenchmarkingMetadata quality | CodeCode Available | 0 |