Scaling Instruction-Finetuned Language Models

2022-10-20Code Available3· sign in to hype

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/google-research/flan
tf★ 1,559
github.com/declare-lab/flan-alpaca
pytorch★ 357
github.com/formulamonks/llm-benchmarker-suite
pytorch★ 48
github.com/theoremone/llm-benchmarker-suite
pytorch★ 48
github.com/zchuz/timebench
none★ 34
github.com/coastalcph/zeroshot_lexglue
none★ 29
github.com/joelniklaus/lawinstruct
none★ 26
github.com/kapllan/zeroshot_lexglue
none★ 0
github.com/yli-z/ml4h_are_clinical_t5_models_better_for_clinical_text
none★ 0

Abstract

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

Tasks

Coreference Resolution Cross-Lingual Question Answering MMLU Multi-task Language Understanding Paraphrase Identification Question Answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Winograd Schema Challenge	Flan-T5 XXL (zero -shot)	Accuracy	89.82	—	Unverified

Scaling Instruction-Finetuned Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions