SOTAVerified

Need a Small Specialized Language Model? Plan Early!

2024-02-02Unverified0· sign in to hype

David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
The PileLarger Transformer 771M (fine-tuned)Test perplexity10Unverified
The PileSmaller Transformer 126M (fine-tuned)Test perplexity12Unverified
The PileLarger Transformer 771M (pre-trained)Test perplexity28.1Unverified
The PileSmaller Transformer 126M (pre-trained)Test perplexity33Unverified

Reproductions