SOTAVerified

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

2024-02-02Unverified0· sign in to hype

Dan Qiao, Yu-Xiang Wang

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of O(H^3 S^2 ABK), while the batch complexity is only O(H+ K). In the above, S denotes the number of states, A,B are the number of actions for the two players respectively, H is the horizon and K is the number of episodes. Furthermore, we prove a batch complexity lower bound (H_AK+ K) for all algorithms with O(K) regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

Tasks

Reproductions