OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

2024-02-15Code Available4· sign in to hype

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, Igor Gitman

Code Available — Be the first to reproduce this paper.

Code

github.com/kipok/nemo-skills
OfficialIn papernone★ 886

Abstract

Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and MATH, two popular math reasoning benchmarks, using the recently released and permissively licensed Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which is competitive with the best gpt-distilled models. We release our code, models, and the OpenMathInstruct-1 dataset under a commercially permissive license.

Tasks

Arithmetic Reasoning GSM8K Math Math Word Problem Solving

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
GSM8K	OpenMath-CodeLlama-70B (w/ code, SC, k=50)	Accuracy	90.8	—	Unverified
GSM8K	OpenMath-CodeLlama-7B (w/ code)	Accuracy	75.9	—	Unverified
GSM8K	OpenMath-CodeLlama-13B (w/ code)	Accuracy	78.8	—	Unverified
GSM8K	OpenMath-Mistral-7B (w/ code)	Accuracy	80.2	—	Unverified
GSM8K	OpenMath-CodeLlama-34B (w/ code)	Accuracy	80.7	—	Unverified
GSM8K	OpenMath-CodeLlama-70B (w/ code)	Accuracy	84.6	—	Unverified
GSM8K	OpenMath-Llama2-70B (w/ code)	Accuracy	84.7	—	Unverified
GSM8K	OpenMath-CodeLlama-7B (w/ code, SC, k=50)	Accuracy	84.8	—	Unverified
GSM8K	OpenMath-CodeLlama-13B (w/ code, SC, k=50)	Accuracy	86.8	—	Unverified
GSM8K	OpenMath-Mistral-7B (w/ code, SC, k=50)	Accuracy	86.9	—	Unverified
GSM8K	OpenMath-CodeLlama-34B (w/ code, SC, k=50)	Accuracy	88	—	Unverified
GSM8K	OpenMath-Llama2-70B (w/ code, SC, k=50)	Accuracy	90.1	—	Unverified

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Code

Abstract

Tasks

Benchmark Results

Reproductions