Offline and Online KL-Regularized RLHF under Differential Privacy

2025-10-15Code Available0· sign in to hype

Yulian Wu, Rushil Thareja, Praneeth Vepakomma, Francesco Orabona

Code Available — Be the first to reproduce this paper.

Code

github.com/rushil-thareja/ppkl-rlhf-official
OfficialIn paper★ 4

Abstract

In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the ε local differential privacy (ε-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of O(1/[(e^ε-1)^2 n]) on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where n is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of O(d_F (N_F T) /(e^ε-1)^2 ), where T is the total time step, N_F is cardinality of the reward function space F and d_F is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.

Offline and Online KL-Regularized RLHF under Differential Privacy

Code

Abstract

Reproductions