| ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts | Feb 8, 2025 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 1 |
| ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks | Feb 7, 2025 | Benchmarking | CodeCode Available | 3 |
| An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks | Feb 7, 2025 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound | Feb 7, 2025 | Benchmarking | CodeCode Available | 4 |
| EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models | Feb 6, 2025 | BenchmarkingEmotional Intelligence | —Unverified | 0 |
| Verifiable Format Control for Large Language Model Generations | Feb 6, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs | Feb 6, 2025 | BenchmarkingEpidemiology | CodeCode Available | 0 |
| Large Language Models for Multi-Robot Systems: A Survey | Feb 6, 2025 | Action GenerationBenchmarking | CodeCode Available | 1 |
| LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset | Feb 6, 2025 | BenchmarkingComputed Tomography (CT) | —Unverified | 0 |
| Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization | Feb 6, 2025 | BenchmarkingUncertainty Quantification | —Unverified | 0 |