Post-Training with Policy Gradients: Optimality and the Base Model Barrier

2026-03-07Unverified0· sign in to hype

Alireza Mousavi-Hosseini, Murat A. Erdogdu

Unverified — Be the first to reproduce this paper.

Abstract

We study post-training linear autoregressive models with outcome and process rewards. Given a context x, the model must predict the response y Y^N, a sequence of length N that satisfies a γ margin condition, an extension of the standard separability to sequences. We prove that on test samples where the base model achieves a non-trivial likelihood α, a variant of policy gradient (PG) can achieve likelihood 1 - with an essentially minimax optimal number of reward queries O((α^-1 + ^-1)/γ^2). However, a barrier arises for going beyond the support of the base model. We prove that the overall expected error after post-training with outcome rewards is governed by a property of the base model called the Likelihood Quantile (LQ), and that variants of PG, while minimax optimal, may require a number of reward queries exponential in N to go beyond this support, regardless of the pre-training algorithm. To overcome this barrier, we study post-training with a process reward model, and demonstrate how PG variants in this setting avoid the curse of dimensionality in N via dependence on a token-level LQ. Along the way, we prove that under the margin condition, SGD with adaptive learning rate (LR) achieves a near optimal test error for statistical learning, and PG with adaptive LR achieves a near optimal number of mistakes for online learning while being computationally efficient whenever possible, both of which may be of independent interest.

Post-Training with Policy Gradients: Optimality and the Base Model Barrier

Abstract

Reproductions