| CRANE: Reasoning with constrained LLM generation | Feb 13, 2025 | Code GenerationMath | —Unverified | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 |
| Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges | Feb 12, 2025 | GSM8KMath | CodeCode Available | 0 |
| Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving | Feb 12, 2025 | Mathmultimodal interaction | —Unverified | 0 |
| O1 Embedder: Let Retrievers Think Before Action | Feb 11, 2025 | Contrastive LearningMath | —Unverified | 0 |
| Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning | Feb 11, 2025 | Code GenerationMath | CodeCode Available | 0 |
| MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations | Feb 10, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization | Feb 8, 2025 | GSM8KMath | —Unverified | 0 |
| BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation | Feb 6, 2025 | In-Context LearningKnowledge Distillation | —Unverified | 0 |
| Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference | Feb 5, 2025 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |