STaR: Bootstrapping Reasoning With Reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/ezelikman/STaROfficialjax★ 221
Abstract
Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30 larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| CommonsenseQA | STaR (on GPT-J) | Accuracy | 72.3 | — | Unverified |
| CommonsenseQA | STaR without Rationalization (on GPT-J) | Accuracy | 68.8 | — | Unverified |
| CommonsenseQA | GPT-J Direct Finetuned | Accuracy | 60 | — | Unverified |
| CommonsenseQA | Few-shot CoT LaMDA 137B | Accuracy | 55.6 | — | Unverified |
| CommonsenseQA | Few-shot CoT GPT-J | Accuracy | 36.6 | — | Unverified |
| CommonsenseQA | Few-shot Direct GPT-J | Accuracy | 20.9 | — | Unverified |