| DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models | Jun 5, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Knowledge-guided Contextual Gene Set Analysis Using Large Language Models | Jun 4, 2025 | Benchmarking | —Unverified | 0 |
| Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset | Jun 4, 2025 | 3D geometryBenchmarking | —Unverified | 0 |
| MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | Jun 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence | Jun 4, 2025 | Benchmarking | —Unverified | 0 |
| MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP | Jun 4, 2025 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models | Jun 4, 2025 | BenchmarkingGeneral Knowledge | CodeCode Available | 0 |
| A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series | Jun 4, 2025 | BenchmarkingIrregular Time Series | CodeCode Available | 0 |
| Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems | Jun 4, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking | Jun 4, 2025 | BenchmarkingCode Generation | —Unverified | 0 |