SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin
Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong liu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose Self-training framework integrating Process Preference learning using Dynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive dynamic value margin on step-level preference optimization, which employs tree-based self-sampling on model responses without any distillation from other models. Furthermore, we theoretically prove that SPPD is equivalent to on-policy policy gradient methods under reward constraints. Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks. We open-source our code at https://anonymous.4open.science/r/SPPD-DCDD.