Provably Efficient Reinforcement Learning with Aggregated States
Shi Dong, Benjamin Van Roy, Zhengyuan Zhou
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret O(H^5 M K + HK), where H is the horizon, M is the number of aggregate states, K is the number of episodes, and is the largest difference between any pair of optimal state-action values associated with a common aggregate state. Notably, this regret bound does not depend on the number of states or actions and indicates that asymptotic per-period regret is no greater than , independent of horizon. To our knowledge, this is the first such result that applies to reinforcement learning with nontrivial value function approximation without any restrictions on transition probabilities.