SOTAVerified

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

2026-03-09Unverified0· sign in to hype

Swetha Ganesh, Vaneet Aggarwal

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility f(J_1^π,,J_M^π) over multiple objectives, where each J_m^π denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on f(J^π), while in practice only empirical return estimates J are available. Because f is nonlinear, the plug-in estimator is biased (E[ f( J)] f(E[ J])), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an intrinsic O(ε^-4) sample complexity due to this bias. To address this issue, we develop a Natural Policy Gradient (NPG) algorithm equipped with a multi-level Monte Carlo (MLMC) estimator that controls the bias of the scalarization gradient while maintaining low sampling cost. We prove that this approach achieves the optimal O(ε^-2) sample complexity for computing an ε-optimal policy. Furthermore, we show that when the scalarization function is second-order smooth, the first-order bias cancels automatically, allowing vanilla NPG to achieve the same O(ε^-2) rate without MLMC. Our results provide the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.

Reproductions