| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Dyve: Thinking Fast and Slow for Dynamic Process Verification | Feb 16, 2025 | Math | CodeCode Available | 1 |
| Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls | Feb 16, 2025 | Computational EfficiencyGSM8K | CodeCode Available | 0 |
| Graders should cheat: privileged information enables expert-level automated evaluations | Feb 16, 2025 | Math | —Unverified | 0 |
| Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping | Feb 16, 2025 | Code GenerationInstruction Following | CodeCode Available | 1 |
| 1bit-Merging: Dynamic Quantized Merging for Large Language Models | Feb 15, 2025 | Code GenerationMath | —Unverified | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 |
| CRANE: Reasoning with constrained LLM generation | Feb 13, 2025 | Code GenerationMath | —Unverified | 0 |
| Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving | Feb 12, 2025 | Mathmultimodal interaction | —Unverified | 0 |
| Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges | Feb 12, 2025 | GSM8KMath | CodeCode Available | 0 |
| LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! | Feb 11, 2025 | Large Language ModelMath | CodeCode Available | 7 |
| Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving | Feb 11, 2025 | Automated Theorem ProvingLarge Language Model | CodeCode Available | 3 |
| O1 Embedder: Let Retrievers Think Before Action | Feb 11, 2025 | Contrastive LearningMath | —Unverified | 0 |
| CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction | Feb 11, 2025 | Code GenerationMath | CodeCode Available | 4 |
| Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning | Feb 11, 2025 | Code GenerationMath | CodeCode Available | 0 |
| Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling | Feb 10, 2025 | Math | CodeCode Available | 3 |
| MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations | Feb 10, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| On the Emergence of Thinking in LLMs I: Searching for the Right Intuition | Feb 10, 2025 | Math | CodeCode Available | 2 |
| ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates | Feb 10, 2025 | Hierarchical Reinforcement LearningLanguage Modeling | CodeCode Available | 4 |
| Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning | Feb 10, 2025 | MathMathematical Reasoning | CodeCode Available | 2 |
| Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization | Feb 8, 2025 | GSM8KMath | —Unverified | 0 |
| GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? | Feb 7, 2025 | 8kInformation Retrieval | CodeCode Available | 2 |
| BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation | Feb 6, 2025 | In-Context LearningKnowledge Distillation | —Unverified | 0 |
| Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 | Feb 5, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment | Feb 5, 2025 | GSM8KHumanEval | —Unverified | 0 |
| Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting | Feb 5, 2025 | GSM8KMath | CodeCode Available | 0 |
| Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference | Feb 5, 2025 | Computational EfficiencyLanguage Modeling | —Unverified | 0 |
| LIMO: Less is More for Reasoning | Feb 5, 2025 | MathMathematical Reasoning | CodeCode Available | 5 |
| Do Large Language Model Benchmarks Test Reliability? | Feb 5, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model | Feb 4, 2025 | Instruction FollowingLanguage Modeling | —Unverified | 0 |
| Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs | Feb 4, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Process Reinforcement through Implicit Rewards | Feb 3, 2025 | MathReinforcement Learning (RL) | CodeCode Available | 5 |
| A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods | Feb 3, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| Blink of an eye: a simple theory for feature localization in generative models | Feb 2, 2025 | Math | —Unverified | 0 |
| Learning Autonomous Code Integration for Math Language Models | Feb 2, 2025 | Math | —Unverified | 0 |
| Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? | Feb 2, 2025 | MathMMLU | —Unverified | 0 |
| UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models | Feb 1, 2025 | Math | CodeCode Available | 2 |
| Fairshare Data Pricing via Data Valuation for Large Language Models | Jan 31, 2025 | Data ValuationMath | —Unverified | 0 |
| BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning | Jan 31, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| s1: Simple test-time scaling | Jan 31, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 9 |
| Pheromone-based Learning of Optimal Reasoning Paths | Jan 31, 2025 | ARCGSM8K | —Unverified | 0 |
| Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping | Jan 31, 2025 | DenoisingImage Denoising | CodeCode Available | 0 |
| PixelWorld: Towards Perceiving Everything as Pixels | Jan 31, 2025 | Math | —Unverified | 0 |
| Examining the Robustness of Large Language Models across Language Complexity | Jan 30, 2025 | Math | —Unverified | 0 |
| Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis | Jan 30, 2025 | Automated Theorem ProvingMath | CodeCode Available | 1 |
| Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH | Jan 30, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate | Jan 29, 2025 | Instruction FollowingMath | CodeCode Available | 2 |
| Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving | Jan 28, 2025 | MathMathematical Problem-Solving | —Unverified | 0 |
| Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework | Jan 26, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning | Jan 25, 2025 | Math | —Unverified | 0 |