| ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models | Feb 22, 2024 | MathMathematical Reasoning | CodeCode Available | 1 |
| LEVER: Learning to Verify Language-to-Code Generation with Execution | Feb 16, 2023 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Let's Verify Math Questions Step by Step | May 20, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation | Jan 24, 2025 | Math | CodeCode Available | 1 |
| Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions | May 28, 2022 | Arithmetic ReasoningEfficient Exploration | CodeCode Available | 1 |
| Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement | Sep 17, 2024 | Active LearningDiversity | CodeCode Available | 1 |
| A Symbolic Character-Aware Model for Solving Geometry Problems | Aug 5, 2023 | MathMulti-Label Classification | CodeCode Available | 1 |
| EXAONE Deep: Reasoning Enhanced Language Models | Mar 16, 2025 | Math | CodeCode Available | 1 |
| Learning Goal-Conditioned Representations for Language Reward Models | Jul 18, 2024 | GSM8KMath | CodeCode Available | 1 |
| Learning by Fixing: Solving Math Word Problems with Weak Supervision | Dec 19, 2020 | MathWeakly-supervised Learning | CodeCode Available | 1 |
| Learning From Mistakes Makes LLM Better Reasoner | Oct 31, 2023 | GSM8KMath | CodeCode Available | 1 |
| Learning Multi-Step Reasoning by Solving Arithmetic Tasks | Jun 2, 2023 | MathMathematical Reasoning | CodeCode Available | 1 |
| CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning | Oct 14, 2024 | MathMathematical Reasoning | CodeCode Available | 1 |
| Collective Constitutional AI: Aligning a Language Model with Public Input | Jun 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| A Categorical Archive of ChatGPT Failures | Feb 6, 2023 | Math | CodeCode Available | 1 |
| Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction | Mar 19, 2022 | MathMath Word Problem Solving | CodeCode Available | 1 |
| Resa: Transparent Reasoning Models via SAEs | Jun 11, 2025 | Math | CodeCode Available | 1 |
| RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning | May 23, 2023 | In-Context LearningLanguage Modelling | CodeCode Available | 1 |
| Large Language Models Can Be Easily Distracted by Irrelevant Context | Jan 31, 2023 | Arithmetic ReasoningLanguage Modeling | CodeCode Available | 1 |
| Large Language Models Are Neurosymbolic Reasoners | Jan 17, 2024 | Common Sense ReasoningMath | CodeCode Available | 1 |
| Language Models Encode the Value of Numbers Linearly | Jan 8, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping | Feb 16, 2025 | Code GenerationInstruction Following | CodeCode Available | 1 |
| Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning | Jan 27, 2023 | Few-Shot LearningGSM8K | CodeCode Available | 1 |
| Large (Vision) Language Models are Unsupervised In-Context Learners | Apr 3, 2025 | GSM8KIn-Context Learning | CodeCode Available | 1 |
| Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities | Feb 17, 2025 | Code GenerationHumanEval | CodeCode Available | 1 |
| Non-myopic Generation of Language Models for Reasoning and Planning | Oct 22, 2024 | Computational EfficiencyLanguage Modelling | CodeCode Available | 1 |
| Language Models as Science Tutors | Feb 16, 2024 | GSM8KMath | CodeCode Available | 1 |
| LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits | Oct 2, 2024 | Instruction FollowingMath | CodeCode Available | 1 |
| Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning | May 12, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models | May 23, 2024 | Knowledge DistillationMath | CodeCode Available | 1 |
| Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs | Nov 8, 2023 | FairnessMath | CodeCode Available | 1 |
| JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding | Jun 13, 2022 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Injecting Numerical Reasoning Skills into Language Models | Apr 9, 2020 | Data AugmentationDecoder | CodeCode Available | 1 |
| CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning | Aug 10, 2022 | MathMathematical Reasoning | CodeCode Available | 1 |
| A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration | Oct 3, 2023 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning | Sep 29, 2022 | Logical ReasoningMath | CodeCode Available | 1 |
| Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction | Jun 5, 2023 | Math | CodeCode Available | 1 |
| FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains | Nov 16, 2023 | MathMath Word Problem Solving | CodeCode Available | 1 |
| LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models | Jul 5, 2025 | BenchmarkingGPU | CodeCode Available | 1 |
| Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | Jun 18, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning | May 30, 2025 | MathMathematical Reasoning | CodeCode Available | 1 |
| Aioli: A Unified Optimization Framework for Language Model Data Mixing | Nov 8, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| HARP: A challenging human-annotated math reasoning benchmark | Dec 11, 2024 | Math | CodeCode Available | 1 |
| How to Get Your LLM to Generate Challenging Problems for Evaluation | Feb 20, 2025 | Code CompletionMath | CodeCode Available | 1 |
| CityGPT: Empowering Urban Spatial Cognition of Large Language Models | Jun 20, 2024 | Code GenerationMath | CodeCode Available | 1 |
| On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents | Aug 2, 2024 | Code GenerationLarge Language Model | CodeCode Available | 1 |
| HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems | May 17, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics | Oct 13, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| How well do Large Language Models perform in Arithmetic tasks? | Mar 16, 2023 | Math | CodeCode Available | 1 |
| GOLD: Geometry Problem Solver with Natural Language Description | May 1, 2024 | Math | CodeCode Available | 1 |