| TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs | Mar 13, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| DarkBench: Benchmarking Dark Patterns in Large Language Models | Mar 13, 2025 | Benchmarking | —Unverified | 0 |
| ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content | Mar 13, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models | Mar 12, 2025 | BenchmarkingFairness | —Unverified | 0 |
| CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding | Mar 12, 2025 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| MarineGym: A High-Performance Reinforcement Learning Platform for Underwater Robotics | Mar 12, 2025 | BenchmarkingGPU | —Unverified | 0 |
| CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection | Mar 12, 2025 | BenchmarkingCode Classification | CodeCode Available | 1 |
| Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges | Mar 11, 2025 | Benchmarking | CodeCode Available | 0 |
| Robust Latent Matters: Boosting Image Generation with Sampling Error | Mar 11, 2025 | BenchmarkingImage Generation | CodeCode Available | 3 |
| Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank Study | Mar 11, 2025 | Benchmarking | —Unverified | 0 |