Variational Bayesian Reinforcement Learning with Regret Bounds

2018-07-25NeurIPS 2021Unverified0· sign in to hype

Brendan O'Donoghue

Unverified — Be the first to reproduce this paper.

Abstract

In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new Bellman operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so that high uncertainty, high reward, or both, can yield high K-values. The key principle is to endow the agent with a risk-seeking utility function that is carefully tuned to balance exploration and exploitation. When the agent follows a Boltzmann policy over the K-values it yields a Bayes regret bound of O(L S A T), where L is the time horizon, S is the total number of states, A is the number of actions, and T is the number of elapsed timesteps. We show deep connections of this approach to the soft-max and maximum-entropy strands of research in reinforcement learning.

Tasks

Q-Learning reinforcement-learning Reinforcement Learning Reinforcement Learning (RL)

Variational Bayesian Reinforcement Learning with Regret Bounds

Abstract

Tasks

Reproductions