| OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification | Apr 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks | Apr 29, 2025 | BenchmarkingMisinformation | CodeCode Available | 1 |
| On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks | Apr 29, 2025 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs | Apr 29, 2025 | BenchmarkingFace Generation | —Unverified | 0 |
| SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories | Apr 29, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Bridging the Generalisation Gap: Synthetic Data Generation for Multi-Site Clinical Model Validation | Apr 29, 2025 | BenchmarkingFairness | CodeCode Available | 0 |
| TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models | Apr 29, 2025 | BenchmarkingDataset Generation | CodeCode Available | 0 |
| Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets | Apr 28, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics | Apr 28, 2025 | Benchmarking | —Unverified | 0 |
| WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution | Apr 28, 2025 | BenchmarkingImage Attribution | —Unverified | 0 |