Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/google-research/flanOfficialIn papertf★ 1,559
- github.com/hiyouga/llama-efficient-tuningpytorch★ 68,894
- github.com/bigcode-project/starcoderpytorch★ 7,528
- github.com/bigscience-workshop/promptsourcenone★ 3,007
- github.com/openbiolink/promptsourcenone★ 9
- github.com/MS-P3/code6/tree/main/finetunemindspore★ 0
- github.com/hojjat-mokhtarabadi/promptsourcenone★ 0
- github.com/ukplab/arxiv2025-inherent-limits-plmsnone★ 0
Abstract
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.