| Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning | Jan 19, 2024 | GSM8KMath | CodeCode Available | 1 |
| Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | Jun 18, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees | Mar 11, 2025 | ChatbotLanguage Modeling | CodeCode Available | 1 |
| Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning | Jun 4, 2023 | Math | CodeCode Available | 1 |
| Implicit Chain of Thought Reasoning via Knowledge Distillation | Nov 2, 2023 | Knowledge DistillationMath | CodeCode Available | 1 |
| Are NLP Models really able to Solve Simple Math Word Problems? | Mar 12, 2021 | MathMath Word Problem Solving | CodeCode Available | 1 |
| Case-Based or Rule-Based: How Do Transformers Do the Math? | Feb 27, 2024 | MathSystematic Generalization | CodeCode Available | 1 |
| Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word Problem | Apr 7, 2020 | DecoderMachine Translation | CodeCode Available | 1 |
| CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models | Dec 23, 2024 | Decision MakingMath | CodeCode Available | 1 |
| AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation | Apr 25, 2024 | Code GenerationMath | CodeCode Available | 1 |