GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization

2026-03-18Code Available0· sign in to hype

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/zzy1127/gift
OfficialIn paper★ 5

Abstract

The prevailing post-training paradigm for Large Reasoning Models (LRMs) - Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) - suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT to reconcile post-training objectives and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that promotes objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway to preserve exploration and align the two post-training stages. Our code is available at https://github.com/zzy1127/GIFT.

GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization

Code

Abstract

Reproductions