| MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts | Oct 3, 2023 | ChatbotImage Captioning | CodeCode Available | 2 | 5 |
| RM-R1: Reward Modeling as Reasoning | May 5, 2025 | MathReinforcement Learning (RL) | CodeCode Available | 2 | 5 |
| MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark | May 20, 2024 | College MathematicsGSM8K | CodeCode Available | 2 | 5 |
| MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code | Oct 10, 2024 | MathMathematical Reasoning | CodeCode Available | 2 | 5 |
| Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning | May 5, 2024 | GSM8KMath | CodeCode Available | 2 | 5 |
| Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models | May 15, 2025 | Mathreinforcement-learning | CodeCode Available | 2 | 5 |
| Full Page Handwriting Recognition via Image to Sequence Extraction | Mar 11, 2021 | Handwriting RecognitionHandwritten Text Recognition | CodeCode Available | 2 | 5 |
| MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning | Oct 5, 2023 | Arithmetic ReasoningGSM8K | CodeCode Available | 2 | 5 |
| MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems | Apr 6, 2024 | Logical ReasoningMath | CodeCode Available | 2 | 5 |
| Agent Lumos: Unified and Modular Training for Open-Source Language Agents | Nov 9, 2023 | MathQuestion Answering | CodeCode Available | 2 | 5 |
| Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models | Apr 7, 2025 | Dialogue EvaluationFairness | CodeCode Available | 2 | 5 |
| Language Models are Multilingual Chain-of-Thought Reasoners | Oct 6, 2022 | GSM8KMath | CodeCode Available | 2 | 5 |
| MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning | Sep 11, 2023 | MathMathematical Reasoning | CodeCode Available | 2 | 5 |
| Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic | Feb 19, 2024 | Instruction FollowingMath | CodeCode Available | 2 | 5 |
| Evaluating Mathematical Reasoning Beyond Accuracy | Apr 8, 2024 | MathMathematical Reasoning | CodeCode Available | 2 | 5 |
| Can AI Assistants Know What They Don't Know? | Jan 24, 2024 | MathOpen-Domain Question Answering | CodeCode Available | 2 | 5 |
| A Comparative Study on Reasoning Patterns of OpenAI's o1 Model | Oct 17, 2024 | Math | CodeCode Available | 2 | 5 |
| LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters | May 27, 2024 | BenchmarkingGSM8K | CodeCode Available | 2 | 5 |
| CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models | Sep 4, 2024 | GSM8KMath | CodeCode Available | 2 | 5 |
| Essential-Web v1.0: 24T tokens of organized web data | Jun 17, 2025 | Math | CodeCode Available | 2 | 5 |
| Dynamic Early Exit in Reasoning Models | Apr 22, 2025 | GSM8KMath | CodeCode Available | 2 | 5 |
| Balancing LoRA Performance and Efficiency with Simple Shard Sharing | Sep 19, 2024 | Computational EfficiencyGSM8K | CodeCode Available | 2 | 5 |
| LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training | Nov 24, 2024 | MathMixture-of-Experts | CodeCode Available | 2 | 5 |
| Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap | Feb 29, 2024 | Math | CodeCode Available | 2 | 5 |
| Learning to Reason for Long-Form Story Generation | Mar 28, 2025 | FormMath | CodeCode Available | 2 | 5 |