Making Small Language Models Better Few-Shot Learners

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

Large-scale language models coupled with prompts have shown remarkable performance on few-shot learning. However, through systematic experiments, we find that the few-shot performance of small language models is poor, and using prompts on them brings fewer improvements than on larger ones. In this paper, we propose SMASH, an approach to improve SMAll language models' few-SHot ability by training on intermediate tasks before prompt-based fine-tuning on downstream tasks. We design intermediate tasks for sentence-pair tasks and single-sentence classification tasks by creating training examples with prompt templates similar to downstream tasks using sentences sampled from a large-scale unsupervised corpus, and apply knowledge distillation to distill from outputs of larger pre-trained models as training objective. We conduct extensive experiments and show that SMASH can make a 6-layer DistilRoBRETa-base achieve comparable performance on few-shot datasets to a 12-layer RoBERTa-base at a low cost.

Tasks

Few-Shot Learning Knowledge Distillation Sentence Sentence Classification

Making Small Language Models Better Few-Shot Learners

Abstract

Tasks

Reproductions