| Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning | Jun 11, 2025 | Image CaptioningMath | CodeCode Available | 2 |
| ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs | Jun 11, 2025 | Code GenerationDiagnostic | CodeCode Available | 1 |
| Resa: Transparent Reasoning Models via SAEs | Jun 11, 2025 | Math | CodeCode Available | 1 |
| TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games | Jun 11, 2025 | Logical ReasoningMath | —Unverified | 0 |
| LeanTutor: A Formally-Verified AI Tutor for Mathematical Proofs | Jun 10, 2025 | Large Language ModelMath | —Unverified | 0 |
| Learning to Reason Across Parallel Samples for LLM Reasoning | Jun 10, 2025 | MathRe-Ranking | —Unverified | 0 |
| Reinforce LLM Reasoning through Multi-Agent Reflection | Jun 10, 2025 | MathOut-of-Distribution Generalization | —Unverified | 0 |
| SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning | Jun 10, 2025 | Knowledge DistillationMath | CodeCode Available | 1 |
| Enhancing Reasoning Capabilities of Small Language Models with Blueprints and Prompt Template Search | Jun 10, 2025 | GSM8KMath | —Unverified | 0 |
| AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions | Jun 10, 2025 | Math | CodeCode Available | 2 |
| WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning | Jun 9, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| Play to Generalize: Learning to Reason Through Game Play | Jun 9, 2025 | Domain GeneralizationMath | CodeCode Available | 2 |
| Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation | Jun 9, 2025 | GSM8KHumanEval | —Unverified | 0 |
| The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity | Jun 7, 2025 | Math | —Unverified | 0 |
| AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search | Jun 6, 2025 | Large Language ModelMath | CodeCode Available | 0 |
| SPARQ: Synthetic Problem Generation for Reasoning via Quality-Diversity Algorithms | Jun 6, 2025 | DiversityLarge Language Model | —Unverified | 0 |
| Spectral Derivatives | Jun 6, 2025 | Math | CodeCode Available | 0 |
| Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models | Jun 5, 2025 | AllMath | —Unverified | 0 |
| Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG-Based Approach for Complex Arithmetic Reasoning | Jun 5, 2025 | Arithmetic ReasoningMath | CodeCode Available | 0 |
| Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers | Jun 5, 2025 | GSM8KMath | —Unverified | 0 |
| Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback | Jun 5, 2025 | Math | —Unverified | 0 |
| TreeRPO: Tree Relative Policy Optimization | Jun 5, 2025 | Math | CodeCode Available | 0 |
| MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning | Jun 5, 2025 | MathMathematical Reasoning | CodeCode Available | 2 |
| Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning | Jun 5, 2025 | MathVisual Grounding | —Unverified | 0 |
| Guided Speculative Inference for Efficient Test-Time Alignment of LLMs | Jun 4, 2025 | Math | CodeCode Available | 0 |