| MMATH: A Multilingual Benchmark for Mathematical Reasoning | May 25, 2025 | MathMathematical Reasoning | CodeCode Available | 0 |
| Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions | May 24, 2025 | Automated Theorem ProvingMath | CodeCode Available | 0 |
| Efficient Long CoT Reasoning in Small Language Models | May 24, 2025 | Mathematical Reasoningvalid | —Unverified | 0 |
| LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges | May 24, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 0 |
| Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation | May 24, 2025 | Mathematical ReasoningMultimodal Reasoning | —Unverified | 0 |
| Unraveling Misinformation Propagation in LLM Reasoning | May 24, 2025 | Mathematical ReasoningMisinformation | CodeCode Available | 0 |
| PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models | May 23, 2025 | Code GenerationMathematical Reasoning | —Unverified | 0 |
| Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence | May 23, 2025 | GPULarge Language Model | —Unverified | 0 |
| The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs | May 23, 2025 | Cross-Lingual TransferMath | —Unverified | 0 |
| MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models | May 22, 2025 | Mathematical Reasoning | —Unverified | 0 |