| MathViz-E: A Case-study in Domain-Specialized Tool-Using Agents | Jul 24, 2024 | Math | CodeCode Available | 1 | 5 |
| MathPrompter: Mathematical Reasoning using Large Language Models | Mar 4, 2023 | Arithmetic ReasoningMath | CodeCode Available | 1 | 5 |
| Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models | Jun 4, 2025 | Math | CodeCode Available | 1 | 5 |
| Building Dataset for Grounding of Formulae — Annotating Coreference Relations Among Math Identifiers | Jun 1, 2022 | Math | CodeCode Available | 1 | 5 |
| Broken Neural Scaling Laws | Oct 26, 2022 | Adversarial RobustnessContinual Learning | CodeCode Available | 1 | 5 |
| Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks | May 30, 2025 | Autonomous DrivingMath | CodeCode Available | 1 | 5 |
| Brilla AI: AI Contestant for the National Science and Maths Quiz | Mar 4, 2024 | MathQuestion Answering | CodeCode Available | 1 | 5 |
| Forgotten Polygons: Multimodal Large Language Models are Shape-Blind | Feb 21, 2025 | MathMathematical Problem-Solving | CodeCode Available | 1 | 5 |
| FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving | Feb 27, 2025 | GSM8KMath | CodeCode Available | 1 | 5 |
| Ape210K: A Large-Scale and Template-Rich Dataset of Math Word Problems | Sep 24, 2020 | DiversityMath | CodeCode Available | 1 | 5 |
| FormulaNet: A Benchmark Dataset for Mathematical Formula Detection | Aug 29, 2022 | Math | CodeCode Available | 1 | 5 |
| Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization | Aug 14, 2024 | InformativenessInstruction Following | CodeCode Available | 1 | 5 |
| FELM: Benchmarking Factuality Evaluation of Large Language Models | Oct 1, 2023 | BenchmarkingMath | CodeCode Available | 1 | 5 |
| Fine-Tuning Large Language Models on Quantum Optimization Problems for Circuit Generation | Apr 15, 2025 | MathQuantum Machine Learning | CodeCode Available | 1 | 5 |
| MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions | May 29, 2024 | BenchmarkingDialogue Understanding | CodeCode Available | 1 | 5 |
| Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations | Oct 31, 2023 | GSM8KMath | CodeCode Available | 1 | 5 |
| Expression Syntax Information Bottleneck for Math Word Problems | Oct 24, 2023 | Math | CodeCode Available | 1 | 5 |
| BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning | Jan 6, 2025 | In-Context LearningMath | CodeCode Available | 1 | 5 |
| Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching | Jul 24, 2024 | Math | CodeCode Available | 1 | 5 |
| Explaining Datasets in Words: Statistical Models with Natural Language Parameters | Sep 13, 2024 | ClusteringLanguage Modeling | CodeCode Available | 1 | 5 |
| MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education | Jun 2, 2021 | Knowledge TracingLanguage Modeling | CodeCode Available | 1 | 5 |
| MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations | Feb 24, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning | Sep 18, 2024 | Math | CodeCode Available | 1 | 5 |
| Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula | Aug 8, 2024 | GSM8KLanguage Modeling | CodeCode Available | 1 | 5 |
| BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Apr 2, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 1 | 5 |