| Nerva: a Truly Sparse Implementation of Neural Networks | Jul 24, 2024 | Math | CodeCode Available | 1 |
| MathViz-E: A Case-study in Domain-Specialized Tool-Using Agents | Jul 24, 2024 | Math | CodeCode Available | 1 |
| Toward Adaptive Reasoning in Large Language Models with Thought Rollback | Jul 21, 2024 | Arithmetic ReasoningMath | CodeCode Available | 1 |
| Learning Goal-Conditioned Representations for Language Reward Models | Jul 18, 2024 | GSM8KMath | CodeCode Available | 1 |
| TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish | Jul 17, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling | Jul 13, 2024 | BenchmarkingMath | CodeCode Available | 1 |
| AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models | Jul 11, 2024 | Language ModellingMath | CodeCode Available | 1 |
| DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning | Jul 4, 2024 | AvgGSM8K | CodeCode Available | 1 |
| Eliminating Position Bias of Language Models: A Mechanistic Approach | Jul 1, 2024 | Mathobject-detection | CodeCode Available | 1 |
| Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning | Jun 30, 2024 | GSM8KMath | CodeCode Available | 1 |
| Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs | Jun 24, 2024 | Instruction FollowingMath | CodeCode Available | 1 |
| LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback | Jun 20, 2024 | Binary ClassificationGSM8K | CodeCode Available | 1 |
| RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold | Jun 20, 2024 | MathReinforcement Learning (RL) | CodeCode Available | 1 |
| CityGPT: Empowering Urban Spatial Cognition of Large Language Models | Jun 20, 2024 | Code GenerationMath | CodeCode Available | 1 |
| Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles | Jun 18, 2024 | Arithmetic ReasoningCode Generation | CodeCode Available | 1 |
| DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling | Jun 17, 2024 | GSM8KMath | CodeCode Available | 1 |
| Collective Constitutional AI: Aligning a Language Model with Public Input | Jun 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning | Jun 6, 2024 | Math | CodeCode Available | 1 |
| TAIA: Large Language Models are Out-of-Distribution Data Learners | May 30, 2024 | Math | CodeCode Available | 1 |
| MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions | May 29, 2024 | BenchmarkingDialogue Understanding | CodeCode Available | 1 |
| JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models | May 23, 2024 | Knowledge DistillationMath | CodeCode Available | 1 |
| Multiple-Choice Questions are Efficient and Robust LLM Evaluators | May 20, 2024 | GSM8KHumanEval | CodeCode Available | 1 |
| TANQ: An open domain dataset of table answered questions | May 13, 2024 | MathOpen-Domain Question Answering | CodeCode Available | 1 |
| VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context | May 8, 2024 | MathMathematical Reasoning | CodeCode Available | 1 |
| GOLD: Geometry Problem Solver with Natural Language Description | May 1, 2024 | Math | CodeCode Available | 1 |