| BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search | Sep 26, 2024 | MathMathematical Problem-Solving | CodeCode Available | 1 |
| MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula | Jul 1, 2024 | Mathematical Problem-Solving | CodeCode Available | 1 |
| MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions | May 29, 2024 | BenchmarkingDialogue Understanding | CodeCode Available | 1 |
| Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks | Apr 23, 2024 | Mathematical Problem-SolvingQuestion Answering | CodeCode Available | 1 |
| Evaluating Language Models for Mathematics through Interactions | Jun 2, 2023 | Language ModellingMathematical Problem-Solving | CodeCode Available | 1 |
| A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets | May 29, 2023 | Bias DetectionCode Generation | CodeCode Available | 1 |
| Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in Transformers | Apr 1, 2023 | Inductive BiasMathematical Problem-Solving | CodeCode Available | 1 |
| LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning | Jun 16, 2025 | Code GenerationMathematical Problem-Solving | CodeCode Available | 0 |
| TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving | Jun 12, 2025 | Logical ReasoningMathematical Problem-Solving | —Unverified | 0 |
| Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation | Jun 8, 2025 | Code GenerationMathematical Problem-Solving | CodeCode Available | 0 |