| RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs | May 22, 2025 | Image ManipulationMath | —Unverified | 0 |
| EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning | May 22, 2025 | GSM8KMath | CodeCode Available | 0 |
| Veracity Bias and Beyond: Uncovering LLMs' Hidden Beliefs in Problem-Solving Reasoning | May 22, 2025 | AttributeMath | —Unverified | 0 |
| X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs | May 22, 2025 | ChatbotMath | CodeCode Available | 0 |
| Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs | May 22, 2025 | DiagnosticMachine Unlearning | CodeCode Available | 1 |
| Training Step-Level Reasoning Verifiers with Formal Verification Tools | May 21, 2025 | Formal LogicMath | CodeCode Available | 1 |
| How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study | May 21, 2025 | Math | CodeCode Available | 0 |
| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |
| MAPS: A Multilingual Benchmark for Global Agent Performance and Security | May 21, 2025 | Code GenerationMath | —Unverified | 0 |
| SSR: Speculative Parallel Scaling Reasoning in Test-time | May 21, 2025 | DiversityMath | —Unverified | 0 |