| PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models | May 30, 2025 | Benchmarking | —Unverified | 0 |
| GenSpace: Benchmarking Spatially-Aware Image Generation | May 30, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation | May 30, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| Progressive Class-level Distillation | May 30, 2025 | BenchmarkingKnowledge Distillation | —Unverified | 0 |
| Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents | May 30, 2025 | BenchmarkingCode Repair | —Unverified | 0 |
| Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| Automated Structured Radiology Report Generation | May 30, 2025 | Benchmarking | —Unverified | 0 |
| Geospatial Foundation Models to Enable Progress on Sustainable Development Goals | May 30, 2025 | BenchmarkingEarth Observation | —Unverified | 0 |
| Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | May 30, 2025 | BenchmarkingCryptanalysis | —Unverified | 0 |
| SORCE: Small Object Retrieval in Complex Environments | May 30, 2025 | BenchmarkingImage Retrieval | CodeCode Available | 0 |