| Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems | Sep 30, 2024 | GSM8KMath | CodeCode Available | 0 | 5 |
| SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving | Oct 19, 2023 | GSM8KMath | CodeCode Available | 0 | 5 |
| ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation | Jun 16, 2024 | Continual LearningGSM8K | CodeCode Available | 0 | 5 |
| SMART: Self-learning Meta-strategy Agent for Reasoning Tasks | Oct 21, 2024 | GSM8KSelf-Learning | CodeCode Available | 0 | 5 |
| AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations | Nov 22, 2023 | Common Sense ReasoningGSM8K | CodeCode Available | 0 | 5 |
| Text-to-LoRA: Instant Transformer Adaption | Jun 6, 2025 | ARCGSM8K | CodeCode Available | 0 | 5 |
| metabench -- A Sparse Benchmark to Measure General Ability in Large Language Models | Jul 4, 2024 | ARCGSM8K | CodeCode Available | 0 | 5 |
| The Price of Format: Diversity Collapse in LLMs | May 25, 2025 | DiversityGSM8K | CodeCode Available | 0 | 5 |
| TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation | Feb 19, 2025 | Dataset GenerationGSM8K | CodeCode Available | 0 | 5 |
| TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students | May 2, 2025 | GSM8KIn-Context Learning | CodeCode Available | 0 | 5 |
| Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting | Feb 5, 2025 | GSM8KMath | CodeCode Available | 0 | 5 |
| VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation | Jun 25, 2024 | ARCBenchmarking | CodeCode Available | 0 | 5 |
| Iterative Reasoning Preference Optimization | Apr 30, 2024 | ARCGSM8K | —Unverified | 0 | 0 |
| Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning | Apr 17, 2024 | GSM8KNavigate | —Unverified | 0 | 0 |
| MALT: Improving Reasoning with Multi-Agent LLM Training | Dec 2, 2024 | Common Sense ReasoningGSM8K | —Unverified | 0 | 0 |
| MAmmoTH2: Scaling Instructions from the Web | May 6, 2024 | ChatbotGSM8K | —Unverified | 0 | 0 |
| Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist | Jul 11, 2024 | GSM8KMath | —Unverified | 0 | 0 |
| Is your LLM trapped in a Mental Set? Investigative study on how mental sets affect the reasoning capabilities of LLMs | Jan 21, 2025 | GSM8KIn-Context Learning | —Unverified | 0 | 0 |
| Interpretable Math Word Problem Solution Generation Via Step-by-step Planning | Jun 1, 2023 | GSM8KLanguage Modeling | —Unverified | 0 | 0 |
| MathAttack: Attacking Large Language Models Towards Math Solving Ability | Sep 4, 2023 | Adversarial AttackGSM8K | —Unverified | 0 | 0 |
| Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models | Feb 18, 2025 | Data AugmentationGSM8K | —Unverified | 0 | 0 |
| Instance-adaptive Zero-shot Chain-of-Thought Prompting | Sep 30, 2024 | GSM8KMath | —Unverified | 0 | 0 |
| MathDivide: Improved mathematical reasoning by large language models | May 12, 2024 | GSM8KLogical Reasoning | —Unverified | 0 | 0 |
| TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling | Oct 18, 2024 | Computational EfficiencyGSM8K | —Unverified | 0 | 0 |
| MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task | Feb 17, 2025 | Code CompletionGSM8K | —Unverified | 0 | 0 |