| Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models | May 19, 2025 | BenchmarkingChatbot | CodeCode Available | 1 |
| PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI | May 19, 2025 | BenchmarkingMinecraft | —Unverified | 0 |
| Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning | May 19, 2025 | Benchmarking | —Unverified | 0 |
| A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design | May 19, 2025 | BenchmarkingDrug Discovery | —Unverified | 0 |
| CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models | May 19, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities | May 19, 2025 | Automated Theorem ProvingBenchmarking | CodeCode Available | 1 |
| LEXam: Benchmarking Legal Reasoning on 340 Law Exams | May 19, 2025 | BenchmarkingLegal Reasoning | —Unverified | 0 |
| Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings | May 19, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents | May 19, 2025 | AI AgentBenchmarking | CodeCode Available | 1 |
| Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning | May 19, 2025 | Benchmarking | CodeCode Available | 0 |