In-Context Probing Approximates Influence Function for Data Valuation

2024-07-17Code Available0· sign in to hype

Cathy Jiao, Gary Gao, Chenyan Xiong

Code Available — Be the first to reproduce this paper.

Code

github.com/cxcscmu/InContextDataValuation
Officialnone★ 1

Abstract

Data valuation quantifies the value of training data, and is used for data attribution (i.e., determining the contribution of training data towards model predictions), and data selection; both of which are important for curating high-quality datasets to train large language models. In our paper, we show that data valuation through in-context probing (i.e., prompting a LLM) approximates influence functions for selecting training data. We provide a theoretical sketch on this connection based on transformer models performing "implicit" gradient descent on its in-context inputs. Our empirical findings show that in-context probing and gradient-based influence frameworks are similar in how they rank training data. Furthermore, fine-tuning experiments on data selected by either method reveal similar model performance.

Tasks

Data Valuation

In-Context Probing Approximates Influence Function for Data Valuation

Code

Abstract

Tasks

Reproductions