Greedy Information Projection for LLM Data Selection

2026-03-14Unverified0· sign in to hype

Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu, Jian Jiao

Unverified — Be the first to reproduce this paper.

Abstract

We present Greedy Information Projection (GIP), a principled framework for choosing training examples for large language model fine-tuning. GIP casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing quality and diversity. Optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, which provides a geometric explanation for the co-emergence of quality and diversity. Building on this view, we employ a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and mathematical reasoning datasets, GIP selects small subsets that match full-data fine-tuning while using only a fraction of examples and compute, unifying quality-aware and diversity-aware selection for efficient fine-tuning.

Greedy Information Projection for LLM Data Selection

Abstract

Reproductions