| HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Jun 5, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation | Jun 5, 2025 | Benchmarking | CodeCode Available | 0 |
| CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx | Jun 5, 2025 | 2D Pose EstimationBenchmarking | —Unverified | 0 |
| Refer to Anything with Vision-Language Prompts | Jun 5, 2025 | BenchmarkingGeneralized Referring Expression Segmentation | —Unverified | 0 |
| DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models | Jun 5, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Benchmarking Large Language Models on Homework Assessment in Circuit Analysis | Jun 5, 2025 | Benchmarking | —Unverified | 0 |
| HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models | Jun 4, 2025 | BenchmarkingGeneral Knowledge | CodeCode Available | 0 |
| Knowledge-guided Contextual Gene Set Analysis Using Large Language Models | Jun 4, 2025 | Benchmarking | —Unverified | 0 |
| MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP | Jun 4, 2025 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| N^2: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion | Jun 4, 2025 | BenchmarkingCausal Inference | —Unverified | 0 |
| Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems | Jun 4, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| CETBench: A Novel Dataset constructed via Transformations over Programs for Benchmarking LLMs for Code-Equivalence Checking | Jun 4, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | Jun 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence | Jun 4, 2025 | Benchmarking | —Unverified | 0 |
| A Kernel-Based Approach for Accurate Steady-State Detection in Performance Time Series | Jun 4, 2025 | BenchmarkingIrregular Time Series | CodeCode Available | 0 |
| Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset | Jun 4, 2025 | 3D geometryBenchmarking | —Unverified | 0 |
| FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models | Jun 3, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation | Jun 3, 2025 | BenchmarkingStyle Transfer | —Unverified | 0 |
| Tactile MNIST: Benchmarking Active Tactile Perception | Jun 3, 2025 | BenchmarkingScene Understanding | —Unverified | 0 |
| AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering | Jun 3, 2025 | Benchmarking | —Unverified | 0 |
| FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes | Jun 3, 2025 | BenchmarkingFeature Engineering | CodeCode Available | 0 |
| CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models | Jun 2, 2025 | Benchmarking | CodeCode Available | 0 |
| ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists | Jun 2, 2025 | BenchmarkingForm | —Unverified | 0 |
| FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents | Jun 2, 2025 | BenchmarkingForm | —Unverified | 0 |
| ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code | Jun 2, 2025 | BenchmarkingCode Generation | —Unverified | 0 |