| Progressive Class-level Distillation | May 30, 2025 | BenchmarkingKnowledge Distillation | —Unverified | 0 |
| GenSpace: Benchmarking Spatially-Aware Image Generation | May 30, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Segmenting France Across Four Centuries | May 30, 2025 | BenchmarkingImage-to-Image Translation | CodeCode Available | 0 |
| ByzFL: Research Framework for Robust Federated Learning | May 30, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents | May 30, 2025 | BenchmarkingBlocking | CodeCode Available | 2 |
| Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | May 30, 2025 | BenchmarkingCryptanalysis | —Unverified | 0 |
| PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models | May 30, 2025 | Benchmarking | —Unverified | 0 |
| Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation | May 30, 2025 | AllBenchmarking | CodeCode Available | 1 |
| MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs | May 30, 2025 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Foundation Models for Zero-Shot Biometric Tasks | May 30, 2025 | AttributeBenchmarking | —Unverified | 0 |
| Geospatial Foundation Models to Enable Progress on Sustainable Development Goals | May 30, 2025 | BenchmarkingEarth Observation | —Unverified | 0 |
| Bench4KE: Benchmarking Automated Competency Question Generation | May 30, 2025 | BenchmarkingQuestion Generation | CodeCode Available | 1 |
| CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation | May 30, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| Automated Structured Radiology Report Generation | May 30, 2025 | Benchmarking | —Unverified | 0 |
| Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs | May 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge | May 29, 2025 | Benchmarking | —Unverified | 0 |
| Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking | May 29, 2025 | BenchmarkingGraph Question Answering | —Unverified | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns | May 29, 2025 | Benchmarking | —Unverified | 0 |
| R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation | May 29, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems | May 29, 2025 | Benchmarking | —Unverified | 0 |
| Toward Memory-Aided World Models: Benchmarking via Spatial Consistency | May 29, 2025 | BenchmarkingMinecraft | CodeCode Available | 1 |
| VERINA: Benchmarking Verifiable Code Generation | May 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 2 |
| LLM Performance for Code Generation on Noisy Tasks | May 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective | May 28, 2025 | BenchmarkingMemorization | CodeCode Available | 0 |