SOTAVerified

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

2026-03-20Unverified0· sign in to hype

Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically multi-source (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from multi-source imperfect preferences through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most ω over K episodes. We propose a unified algorithm with regret O(K/M+ω), which exhibits a best-of-both-regimes behavior: it achieves M-dependent statistical gains when imperfection is small (where M is the number of sources), while remaining robust with unavoidable additive dependence on ω when imperfection is large. We complement this with a lower bound Ω( /M,ω\), which captures the best possible improvement with respect to M and the unavoidable dependence on ω, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as Ω(\ωK,K\). Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

Reproductions