SOTAVerified

Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

2024-10-10Code Available2· sign in to hype

Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about the model's response given a prompt, which unifies ideas from retrieval and active learning. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT accounts for information duplication and optimizes the overall information gain of the selected examples. We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. We provide the activeft (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
The PileTest-Time Fine-Tuning with SIFT + Llama-3.2 (3B)Bits per byte0.56Unverified
The PileTest-Time Fine-Tuning with SIFT + Phi-3 (3.8B)Bits per byte0.6Unverified
The PileTest-Time Fine-Tuning with SIFT + Llama-3.2 (1B)Bits per byte0.61Unverified
The PileGemma-2 27BBits per byte0.63Unverified
The PileLlama-3.2 3BBits per byte0.64Unverified
The PilePhi-3 14BBits per byte0.65Unverified
The PileGemma-2 9BBits per byte0.67Unverified
The PilePhi-3 7BBits per byte0.68Unverified
The PilePhi-3 3.8BBits per byte0.68Unverified
The PileLlama-3.2 1BBits per byte0.7Unverified
The PileGemma-2 2BBits per byte0.72Unverified
The PileLlama-3.2-Instruct 3BBits per byte0.74Unverified
The PileTest-Time Fine-Tuning with SIFT + GPT-2 (774M)Bits per byte0.76Unverified
The PileLlama-3.2-Instruct 1BBits per byte0.81Unverified
The PileTest-Time Fine-Tuning with SIFT + GPT-2 (124M)Bits per byte0.86Unverified

Reproductions