| Benchmarking Reasoning Robustness in Large Language Models | Mar 6, 2025 | BenchmarkingMath | —Unverified | 0 |
| SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning | Mar 6, 2025 | GSM8KMath | —Unverified | 0 |
| HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks | Mar 6, 2025 | ChatbotLogical Reasoning | —Unverified | 0 |
| Compositional Causal Reasoning Evaluation in Language Models | Mar 6, 2025 | Math | —Unverified | 0 |
| START: Self-taught Reasoner with Tools | Mar 6, 2025 | MathSelf-Learning | —Unverified | 0 |
| LEWIS (LayEr WIse Sparsity) -- A Training Free Guided Model Merging Approach | Mar 5, 2025 | Instruction FollowingMath | —Unverified | 0 |
| Performance Comparison of Large Language Models on Advanced Calculus Problems | Mar 5, 2025 | MathMathematical Problem-Solving | —Unverified | 0 |
| FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4 | Mar 5, 2025 | Answer SelectionMath | —Unverified | 0 |
| Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models | Mar 4, 2025 | GSM8KMath | —Unverified | 0 |
| What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret | Mar 3, 2025 | MathReinforcement Learning (RL) | —Unverified | 0 |