| VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context | May 8, 2024 | MathMathematical Reasoning | CodeCode Available | 1 |
| MAmmoTH2: Scaling Instructions from the Web | May 6, 2024 | ChatbotGSM8K | —Unverified | 0 |
| Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning | May 5, 2024 | GSM8KMath | CodeCode Available | 2 |
| Assessing and Verifying Task Utility in LLM-Powered Applications | May 3, 2024 | Math | —Unverified | 0 |
| Math Multiple Choice Question Generation via Human-Large Language Model Collaboration | May 1, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| GOLD: Geometry Problem Solver with Natural Language Description | May 1, 2024 | Math | CodeCode Available | 1 |
| A Careful Examination of Large Language Model Performance on Grade School Arithmetic | May 1, 2024 | GSM8KLanguage Modeling | —Unverified | 0 |
| Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning | May 1, 2024 | ARCGSM8K | CodeCode Available | 3 |
| Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models | May 1, 2024 | Math | —Unverified | 0 |
| Iterative Reasoning Preference Optimization | Apr 30, 2024 | ARCGSM8K | —Unverified | 0 |
| PECC: Problem Extraction and Coding Challenges | Apr 29, 2024 | Code GenerationMath | CodeCode Available | 1 |
| Small Language Models Need Strong Verifiers to Self-Correct Reasoning | Apr 26, 2024 | Math | CodeCode Available | 0 |
| LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding | Apr 25, 2024 | GSM8KHellaSwag | CodeCode Available | 3 |
| AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation | Apr 25, 2024 | Code GenerationMath | CodeCode Available | 1 |
| Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems | Apr 23, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 1 |
| Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training | Apr 22, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit | Apr 22, 2024 | Math | CodeCode Available | 5 |
| PARAMANU-GANITA: Language Model with Mathematical Capabilities | Apr 22, 2024 | Domain AdaptationGSM8K | —Unverified | 0 |
| Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone | Apr 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank | Apr 19, 2024 | Distractor GenerationMath | —Unverified | 0 |
| Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing | Apr 18, 2024 | Arithmetic ReasoningGSM8K | CodeCode Available | 1 |
| On the Empirical Complexity of Reasoning and Planning in LLMs | Apr 17, 2024 | Math | —Unverified | 0 |
| Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards | Apr 16, 2024 | GSM8KMath | CodeCode Available | 2 |
| Mental Stress Detection: Development and Evaluation of a Wearable In-Ear Plethysmography | Apr 12, 2024 | MathMental Stress Detection | —Unverified | 0 |
| Rho-1: Not All Tokens Are What You Need | Apr 11, 2024 | AllContinual Pretraining | CodeCode Available | 3 |
| Personality-aware Student Simulation for Conversational Intelligent Tutoring Systems | Apr 10, 2024 | Math | —Unverified | 0 |
| MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education | Apr 10, 2024 | Math | —Unverified | 0 |
| Evaluating Mathematical Reasoning Beyond Accuracy | Apr 8, 2024 | MathMathematical Reasoning | CodeCode Available | 2 |
| MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification | Apr 7, 2024 | Image ComprehensionMath | CodeCode Available | 0 |
| FRACTAL: Fine-Grained Scoring from Aggregate Text Labels | Apr 7, 2024 | MathMultiple Instance Learning | —Unverified | 0 |
| MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems | Apr 6, 2024 | Logical ReasoningMath | CodeCode Available | 2 |
| Data Augmentation with In-Context Learning and Comparative Evaluation in Math Word Problem Solving | Apr 5, 2024 | Data AugmentationIn-Context Learning | —Unverified | 0 |
| BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models | Apr 3, 2024 | GPUMath | CodeCode Available | 3 |
| ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline | Apr 3, 2024 | MathMathematical Problem-Solving | CodeCode Available | 2 |
| Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT | Apr 3, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 |
| LM^2: A Simple Society of Language Models Solves Complex Reasoning | Apr 2, 2024 | MathMedQA | CodeCode Available | 0 |
| Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models | Apr 2, 2024 | Distractor GenerationIn-Context Learning | CodeCode Available | 0 |
| HyperCLOVA X Technical Report | Apr 2, 2024 | Instruction FollowingMachine Translation | —Unverified | 0 |
| Exploring the Mystery of Influential Data for Mathematical Reasoning | Apr 1, 2024 | MathMathematical Reasoning | —Unverified | 0 |
| Stable Code Technical Report | Apr 1, 2024 | Code CompletionLanguage Modelling | —Unverified | 0 |
| IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations | Apr 1, 2024 | BenchmarkingMath | —Unverified | 0 |
| Self-Demos: Eliciting Out-of-Demonstration Generalizability in Large Language Models | Apr 1, 2024 | In-Context LearningMath | CodeCode Available | 0 |
| What is in Your Safe Data? Identifying Benign Data that Breaks Safety | Apr 1, 2024 | Math | CodeCode Available | 1 |
| Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange | Mar 30, 2024 | MathMathematical Problem-Solving | CodeCode Available | 0 |
| ML2SC: Deploying Machine Learning Models as Smart Contracts on the Blockchain | Mar 28, 2024 | Math | —Unverified | 0 |
| Large Language Models Are Struggle to Cope with Unreasonability in Math Problems | Mar 28, 2024 | Math | —Unverified | 0 |
| Scaling up ridge regression for brain encoding in a massive individual fMRI dataset | Mar 28, 2024 | CPUMath | CodeCode Available | 0 |
| Few-Shot Recalibration of Language Models | Mar 27, 2024 | MathMMLU | —Unverified | 0 |
| The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian | Mar 27, 2024 | Language ModellingMath | —Unverified | 0 |
| Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization | Mar 26, 2024 | Automated Theorem ProvingGSM8K | CodeCode Available | 1 |