| DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models | Jun 5, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| FRED: The Florence RGB-Event Drone Dataset | Jun 5, 2025 | BenchmarkingTrajectory Forecasting | —Unverified | 0 |
| Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation | Jun 5, 2025 | Benchmarking | CodeCode Available | 0 |
| Refer to Anything with Vision-Language Prompts | Jun 5, 2025 | BenchmarkingGeneralized Referring Expression Segmentation | —Unverified | 0 |
| VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos | Jun 5, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories | Jun 5, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 2 |
| CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx | Jun 5, 2025 | 2D Pose EstimationBenchmarking | —Unverified | 0 |
| From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems | Jun 5, 2025 | BenchmarkingRAG | —Unverified | 0 |
| HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Jun 5, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values | Jun 5, 2025 | Benchmarking | —Unverified | 0 |