Need a Small Specialized Language Model? Plan Early!

2024-02-02Unverified0· sign in to hype

David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun

Unverified — Be the first to reproduce this paper.

Abstract

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.

Tasks

Language Modeling Language Modelling model

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
The Pile	Larger Transformer 771M (fine-tuned)	Test perplexity	10	—	Unverified
The Pile	Smaller Transformer 126M (fine-tuned)	Test perplexity	12	—	Unverified
The Pile	Larger Transformer 771M (pre-trained)	Test perplexity	28.1	—	Unverified
The Pile	Smaller Transformer 126M (pre-trained)	Test perplexity	33	—	Unverified

Need a Small Specialized Language Model? Plan Early!

Abstract

Tasks

Benchmark Results

Reproductions