Small Language Model as Data Prospector for Large Language Model

2024-12-13Unverified0· sign in to hype

Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, Min Yang

Unverified — Be the first to reproduce this paper.

Abstract

The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, li2023one proposed NUGGETS, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose SuperNUGGETS, an improved variant of NUGGETS optimised for efficiency and performance. Our SuperNUGGETS uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of SuperNUGGETS only decreases by 1-2% compared to NUGGETS, but the efficiency can be increased by a factor of 58. Compared to the original NUGGETS, our SuperNUGGETS has a higher utility value due to the significantly lower resource consumption.

Tasks

Language Modeling Language Modelling Large Language Model model Small Language Model

Small Language Model as Data Prospector for Large Language Model

Abstract

Tasks

Reproductions