| Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models | Mar 3, 2025 | Math | —Unverified | 0 |
| MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts | Feb 28, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training | Feb 28, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving | Feb 27, 2025 | GSM8KMath | CodeCode Available | 1 |
| Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning | Feb 27, 2025 | MathMedical Question Answering | —Unverified | 0 |
| Self-Training Elicits Concise Reasoning in Large Language Models | Feb 27, 2025 | GSM8KIn-Context Learning | CodeCode Available | 1 |
| Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? | Feb 26, 2025 | Math | CodeCode Available | 1 |
| Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation | Feb 26, 2025 | Code GenerationHumanEval | CodeCode Available | 2 |
| Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning | Feb 25, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution | Feb 25, 2025 | MathReinforcement Learning (RL) | —Unverified | 0 |
| From Euler to AI: Unifying Formulas for Mathematical Constants | Feb 24, 2025 | Math | CodeCode Available | 0 |
| Learning Decentralized Swarms Using Rotation Equivariant Graph Neural Networks | Feb 24, 2025 | Graph Neural NetworkMath | CodeCode Available | 0 |
| Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models | Feb 24, 2025 | GSM8KMath | CodeCode Available | 2 |
| Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning | Feb 24, 2025 | MathMathematical Reasoning | CodeCode Available | 0 |
| Reasoning with Latent Thoughts: On the Power of Looped Transformers | Feb 24, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling | Feb 23, 2025 | Computational EfficiencyMath | —Unverified | 0 |
| SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance | Feb 23, 2025 | Math | —Unverified | 0 |
| Inference Computation Scaling for Feature Augmentation in Recommendation Systems | Feb 22, 2025 | MathRecommendation Systems | —Unverified | 0 |
| Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning | Feb 21, 2025 | Math | —Unverified | 0 |
| The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer | Feb 21, 2025 | MathMathematical Reasoning | CodeCode Available | 0 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning | Feb 20, 2025 | Mathreinforcement-learning | CodeCode Available | 7 |
| S*: Test Time Scaling for Code Generation | Feb 20, 2025 | Code GenerationMath | CodeCode Available | 7 |
| GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks | Feb 20, 2025 | Code GenerationMath | CodeCode Available | 0 |
| How to Get Your LLM to Generate Challenging Problems for Evaluation | Feb 20, 2025 | Code CompletionMath | CodeCode Available | 1 |
| CER: Confidence Enhanced Reasoning in LLMs | Feb 20, 2025 | MathMathematical Reasoning | CodeCode Available | 0 |
| Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective | Feb 20, 2025 | GSM8KMath | CodeCode Available | 0 |
| A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics | Feb 20, 2025 | Math | —Unverified | 0 |
| SIFT: Grounding LLM Reasoning in Contexts via Stickers | Feb 19, 2025 | GSM8KMath | CodeCode Available | 2 |
| BeamLoRA: Beam-Constraint Low-Rank Adaptation | Feb 19, 2025 | Code GenerationMath | —Unverified | 0 |
| DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation | Feb 19, 2025 | DiversityExtreme Summarization | —Unverified | 0 |
| The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding? | Feb 19, 2025 | Math | —Unverified | 0 |
| TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation | Feb 19, 2025 | Dataset GenerationGSM8K | CodeCode Available | 0 |
| Reasoning with Reinforced Functional Token Tuning | Feb 19, 2025 | Math | CodeCode Available | 1 |
| Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization | Feb 18, 2025 | Math | —Unverified | 0 |
| Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees | Feb 18, 2025 | Math | —Unverified | 0 |
| None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks | Feb 18, 2025 | MathMemorization | —Unverified | 0 |
| S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning | Feb 18, 2025 | Math | CodeCode Available | 2 |
| NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions | Feb 18, 2025 | Knowledge DistillationMath | —Unverified | 0 |
| Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation | Feb 18, 2025 | DiversityMath | —Unverified | 0 |
| Thinking Preference Optimization | Feb 17, 2025 | Math | CodeCode Available | 1 |
| MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task | Feb 17, 2025 | Code CompletionGSM8K | —Unverified | 0 |
| Scaling Test-Time Compute Without Verification or RL is Suboptimal | Feb 17, 2025 | MathReinforcement Learning (RL) | —Unverified | 0 |
| Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving | Feb 17, 2025 | MathMathematical Problem-Solving | —Unverified | 0 |
| Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption | Feb 17, 2025 | BenchmarkingCode Summarization | —Unverified | 0 |
| Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding | Feb 17, 2025 | Arithmetic ReasoningChart Understanding | —Unverified | 0 |
| A Study on Leveraging Search and Self-Feedback for Agent Reasoning | Feb 17, 2025 | Math | —Unverified | 0 |
| Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation | Feb 17, 2025 | Knowledge DistillationMath | CodeCode Available | 0 |
| Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models | Feb 17, 2025 | Math | —Unverified | 0 |
| Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL | Feb 17, 2025 | Code GenerationMath | CodeCode Available | 1 |