| Thinking Preference Optimization | Feb 17, 2025 | Math | CodeCode Available | 1 |
| Dyve: Thinking Fast and Slow for Dynamic Process Verification | Feb 16, 2025 | Math | CodeCode Available | 1 |
| Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping | Feb 16, 2025 | Code GenerationInstruction Following | CodeCode Available | 1 |
| Do Large Language Model Benchmarks Test Reliability? | Feb 5, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Feb 3, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis | Jan 30, 2025 | Automated Theorem ProvingMath | CodeCode Available | 1 |
| Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation | Jan 24, 2025 | Math | CodeCode Available | 1 |
| Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament | Jan 22, 2025 | Math | CodeCode Available | 1 |
| Control LLM: Controlled Evolution for Intelligence Retention in LLM | Jan 19, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian | Jan 12, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs | Jan 11, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning | Jan 6, 2025 | In-Context LearningMath | CodeCode Available | 1 |
| CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis | Jan 3, 2025 | Math | CodeCode Available | 1 |
| Toward Adaptive Reasoning in Large Language Models with Thought Rollback | Dec 27, 2024 | Math | CodeCode Available | 1 |
| CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models | Dec 23, 2024 | Decision MakingMath | CodeCode Available | 1 |
| Entropy-Regularized Process Reward Model | Dec 15, 2024 | GSM8KMath | CodeCode Available | 1 |
| HARP: A challenging human-annotated math reasoning benchmark | Dec 11, 2024 | Math | CodeCode Available | 1 |
| U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs | Dec 4, 2024 | DiversityMath | CodeCode Available | 1 |
| Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability | Nov 29, 2024 | GSM8KMath | CodeCode Available | 1 |
| Training and Evaluating Language Models with Template-based Data Generation | Nov 27, 2024 | Data AugmentationMath | CodeCode Available | 1 |
| Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues | Nov 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations | Nov 12, 2024 | MathRetrieval | CodeCode Available | 1 |
| What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? | Nov 12, 2024 | GSM8KMath | CodeCode Available | 1 |
| UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts | Nov 11, 2024 | Code GenerationGSM8K | CodeCode Available | 1 |
| Aioli: A Unified Optimization Framework for Language Model Data Mixing | Nov 8, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |