| QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation | Jul 17, 2025 | MathReinforcement Learning (RL) | —Unverified | 0 |
| VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks | Jul 17, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training | Jul 16, 2025 | Code GenerationMath | —Unverified | 0 |
| Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding | Jul 15, 2025 | Math | —Unverified | 0 |
| Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing | Jul 15, 2025 | Knowledge TracingMath | CodeCode Available | 0 |
| Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination | Jul 14, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning | Jul 11, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs | Jul 10, 2025 | CoLALarge Language Model | —Unverified | 0 |
| Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model | Jul 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CoRE: Enhancing Metacognition with Label-free Self-evaluation in LRMs | Jul 8, 2025 | GSM8KMath | —Unverified | 0 |
| The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains | Jul 8, 2025 | MathMMLU | CodeCode Available | 1 |
| Activation Steering for Chain-of-Thought Compression | Jul 7, 2025 | GSM8KMath | CodeCode Available | 0 |
| LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models | Jul 5, 2025 | BenchmarkingGPU | CodeCode Available | 1 |
| EvoAgentX: An Automated Framework for Evolving Agentic Workflows | Jul 4, 2025 | Code GenerationMath | CodeCode Available | 7 |
| Effects of structure on reasoning in instance-level Self-Discover | Jul 4, 2025 | Math | CodeCode Available | 0 |
| Energy-Based Transformers are Scalable Learners and Thinkers | Jul 2, 2025 | DenoisingImage Denoising | CodeCode Available | 4 |
| SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning | Jun 30, 2025 | MathMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model | Jun 30, 2025 | Math | —Unverified | 0 |
| Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test | Jun 26, 2025 | Code GenerationLarge Language Model | —Unverified | 0 |
| Bridging Offline and Online Reinforcement Learning for LLMs | Jun 26, 2025 | Instruction FollowingMath | —Unverified | 0 |
| Multi-lingual Functional Evaluation for Large Language Models | Jun 25, 2025 | BelebeleInstruction Following | —Unverified | 0 |
| AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control | Jun 25, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling | Jun 25, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs | Jun 25, 2025 | Math | —Unverified | 0 |
| Causal Decomposition Analysis with Synergistic Interventions: A Triply-Robust Machine Learning Approach to Addressing Multiple Dimensions of Social Disparities | Jun 23, 2025 | Math | —Unverified | 0 |