| PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset | May 30, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 0 |
| LLM Performance for Code Generation on Noisy Tasks | May 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation | May 29, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Joint Phase Shift Optimization and Precoder Selection for RIS-Assisted 5G NR MIMO Systems | May 29, 2025 | Benchmarking | —Unverified | 0 |
| Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns | May 29, 2025 | Benchmarking | —Unverified | 0 |
| Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking | May 29, 2025 | BenchmarkingGraph Question Answering | —Unverified | 0 |
| MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge | May 29, 2025 | Benchmarking | —Unverified | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs | May 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| Jailbreak Distillation: Renewable Safety Benchmarking | May 28, 2025 | BenchmarkingDiversity | —Unverified | 0 |