| Anchored Diffusion Language Model | May 24, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark | May 24, 2025 | Math | CodeCode Available | 0 |
| MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors | May 24, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization | May 24, 2025 | MathReinforcement Learning (RL) | —Unverified | 0 |
| VideoGameBench: Can Vision-Language Models complete popular video games? | May 23, 2025 | Math | —Unverified | 0 |
| Outcome-based Reinforcement Learning to Predict the Future | May 23, 2025 | Holdout SetMath | —Unverified | 0 |
| More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models | May 23, 2025 | DiagnosticHallucination | —Unverified | 0 |
| The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs | May 23, 2025 | Cross-Lingual TransferMath | —Unverified | 0 |
| One RL to See Them All: Visual Triple Unified Reinforcement Learning | May 23, 2025 | AllMath | —Unverified | 0 |
| AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning | May 22, 2025 | Mathreinforcement-learning | —Unverified | 0 |
| X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs | May 22, 2025 | ChatbotMath | CodeCode Available | 0 |
| EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning | May 22, 2025 | GSM8KMath | CodeCode Available | 0 |
| SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning | May 22, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Incremental Sequence Classification with Temporal Consistency | May 22, 2025 | ClassificationLanguage Modeling | —Unverified | 0 |
| ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models | May 22, 2025 | Large Language ModelMath | CodeCode Available | 0 |
| Veracity Bias and Beyond: Uncovering LLMs' Hidden Beliefs in Problem-Solving Reasoning | May 22, 2025 | AttributeMath | —Unverified | 0 |
| RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs | May 22, 2025 | Image ManipulationMath | —Unverified | 0 |
| Can LLMs understand Math? -- Exploring the Pitfalls in Mathematical Reasoning | May 21, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |
| SSR: Speculative Parallel Scaling Reasoning in Test-time | May 21, 2025 | DiversityMath | —Unverified | 0 |
| MIRB: Mathematical Information Retrieval Benchmark | May 21, 2025 | Automated Theorem ProvingInformation Retrieval | CodeCode Available | 0 |
| How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study | May 21, 2025 | Math | CodeCode Available | 0 |
| Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision | May 21, 2025 | GSM8KLearning-To-Rank | —Unverified | 0 |
| Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities | May 21, 2025 | MathReinforcement Learning (RL) | —Unverified | 0 |
| MAPS: A Multilingual Benchmark for Global Agent Performance and Security | May 21, 2025 | Code GenerationMath | —Unverified | 0 |
| RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning | May 20, 2025 | MathReinforcement Learning (RL) | —Unverified | 0 |
| EasyMath: A 0-shot Math Benchmark for SLMs | May 20, 2025 | Math | —Unverified | 0 |
| The Hallucination Tax of Reinforcement Finetuning | May 20, 2025 | HallucinationMath | —Unverified | 0 |
| Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning | May 20, 2025 | MathOffline RL | —Unverified | 0 |
| Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings | May 19, 2025 | HumanEvalMath | CodeCode Available | 0 |
| AutoMathKG: The automated mathematical knowledge graph based on LLM and vector database | May 19, 2025 | Data AugmentationIn-Context Learning | —Unverified | 0 |
| SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization | May 18, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| MARGE: Improving Math Reasoning for LLMs with Guided Exploration | May 18, 2025 | MathMathematical Reasoning | CodeCode Available | 0 |
| MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities | May 17, 2025 | Math | —Unverified | 0 |
| LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades | May 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class | May 17, 2025 | MathMathematical Problem-Solving | CodeCode Available | 0 |
| SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning | May 16, 2025 | Math | —Unverified | 0 |
| HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization | May 16, 2025 | Math | CodeCode Available | 0 |
| Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation | May 16, 2025 | MathMMLU | —Unverified | 0 |
| Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models | May 15, 2025 | Code GenerationGSM8K | —Unverified | 0 |
| DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs | May 15, 2025 | BenchmarkingFairness | —Unverified | 0 |
| Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models | May 15, 2025 | Large Language ModelMath | CodeCode Available | 0 |
| PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning | May 14, 2025 | MathMathematical Problem-Solving | CodeCode Available | 0 |
| Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping | May 13, 2025 | Domain GeneralizationGSM8K | —Unverified | 0 |
| Measurement to Meaning: A Validity-Centered Framework for AI Evaluation | May 13, 2025 | Math | —Unverified | 0 |
| S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models | May 12, 2025 | GSM8KLarge Language Model | —Unverified | 0 |
| Learning from Peers in Reasoning Models | May 12, 2025 | Math | —Unverified | 0 |
| Multimodal Assessment of Classroom Discourse Quality: A Text-Centered Attention-Based Multi-Task Learning Approach | May 12, 2025 | MathMulti-Task Learning | —Unverified | 0 |
| DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs | May 11, 2025 | DiversityMath | —Unverified | 0 |
| xGen-small Technical Report | May 10, 2025 | DecoderMath | —Unverified | 0 |