STaR: Bootstrapping Reasoning With Reasoning

2022-03-28Code Available2· sign in to hype

Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman

Code Available — Be the first to reproduce this paper.

Code

github.com/ezelikman/STaR
Officialjax★ 221

Abstract

Generating step-by-step "chain-of-thought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense question-answering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30 larger state-of-the-art language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning.

Tasks

Common Sense Reasoning Language Modeling Language Modelling Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CommonsenseQA	STaR (on GPT-J)	Accuracy	72.3	—	Unverified
CommonsenseQA	STaR without Rationalization (on GPT-J)	Accuracy	68.8	—	Unverified
CommonsenseQA	GPT-J Direct Finetuned	Accuracy	60	—	Unverified
CommonsenseQA	Few-shot CoT LaMDA 137B	Accuracy	55.6	—	Unverified
CommonsenseQA	Few-shot CoT GPT-J	Accuracy	36.6	—	Unverified
CommonsenseQA	Few-shot Direct GPT-J	Accuracy	20.9	—	Unverified

STaR: Bootstrapping Reasoning With Reasoning

Code

Abstract

Tasks

Benchmark Results

Reproductions