| EvoAgentX: An Automated Framework for Evolving Agentic Workflows | Jul 4, 2025 | Code GenerationMath | CodeCode Available | 7 |
| LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning | Jun 16, 2025 | Code GenerationMathematical Problem-Solving | CodeCode Available | 0 |
| TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving | Jun 12, 2025 | Logical ReasoningMathematical Problem-Solving | —Unverified | 0 |
| SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning | Jun 10, 2025 | Knowledge DistillationMath | CodeCode Available | 1 |
| Solving Inequality Proofs with Large Language Models | Jun 9, 2025 | Mathematical Problem-SolvingRelation Prediction | CodeCode Available | 1 |
| Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation | Jun 8, 2025 | Code GenerationMathematical Problem-Solving | CodeCode Available | 0 |
| MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning | Jun 5, 2025 | Dataset GenerationMathematical Problem-Solving | CodeCode Available | 1 |
| PoLAR: Polar-Decomposed Low-Rank Adapter Representation | Jun 3, 2025 | Mathematical Problem-SolvingRiemannian optimization | —Unverified | 0 |
| Evaluation of LLMs for mathematical problem solving | May 30, 2025 | GSM8KMathematical Problem-Solving | —Unverified | 0 |
| Decomposing Elements of Problem Solving: What "Math" Does RL Teach? | May 28, 2025 | MathMathematical Problem-Solving | CodeCode Available | 0 |
| Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision | May 26, 2025 | HallucinationMath | CodeCode Available | 0 |
| Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers | May 26, 2025 | Logical ReasoningMathematical Problem-Solving | CodeCode Available | 0 |
| RaDeR: Reasoning-aware Dense Retrieval Models | May 23, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu | May 22, 2025 | Mathematical Problem-Solving | —Unverified | 0 |
| SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving | May 22, 2025 | DiagnosticMathematical Problem-Solving | —Unverified | 0 |
| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |
| HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class | May 17, 2025 | MathMathematical Problem-Solving | CodeCode Available | 0 |
| Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations | May 16, 2025 | Code GenerationMathematical Problem-Solving | —Unverified | 0 |
| Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs | May 16, 2025 | Mathematical Problem-SolvingReinforcement Learning (RL) | —Unverified | 0 |
| PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning | May 14, 2025 | MathMathematical Problem-Solving | CodeCode Available | 0 |
| Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving | May 12, 2025 | MathMathematical Problem-Solving | CodeCode Available | 2 |
| Reasoning Models Can Be Effective Without Thinking | Apr 14, 2025 | Automated Theorem ProvingMathematical Problem-Solving | —Unverified | 0 |
| Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models | Apr 9, 2025 | Instruction FollowingMathematical Problem-Solving | —Unverified | 0 |
| On Vanishing Variance in Transformer Length Generalization | Apr 3, 2025 | AttributeMathematical Problem-Solving | —Unverified | 0 |
| LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models | Apr 3, 2025 | Mathematical Problem-SolvingPrompt Engineering | —Unverified | 0 |