| Seeing in the Dark: Benchmarking Egocentric 3D Vision with the Oxford Day-and-Night Dataset | Jun 4, 2025 | 3D geometryBenchmarking | —Unverified | 0 |
| FlowerTune: A Cross-Domain Benchmark for Federated Fine-Tuning of Large Language Models | Jun 3, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation | Jun 3, 2025 | BenchmarkingStyle Transfer | —Unverified | 0 |
| Tactile MNIST: Benchmarking Active Tactile Perception | Jun 3, 2025 | BenchmarkingScene Understanding | —Unverified | 0 |
| AMLgentex: Mobilizing Data-Driven Research to Combat Money Laundering | Jun 3, 2025 | Benchmarking | —Unverified | 0 |
| FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes | Jun 3, 2025 | BenchmarkingFeature Engineering | CodeCode Available | 0 |
| CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models | Jun 2, 2025 | Benchmarking | CodeCode Available | 0 |
| ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists | Jun 2, 2025 | BenchmarkingForm | —Unverified | 0 |
| FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents | Jun 2, 2025 | BenchmarkingForm | —Unverified | 0 |
| ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code | Jun 2, 2025 | BenchmarkingCode Generation | —Unverified | 0 |