Increasing the Action Gap: New Operators for Reinforcement Learning
Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, Rémi Munos
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/janhuenermann/neurojstf★ 0
- github.com/chainer/chainerrlpytorch★ 0
Abstract
This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird's advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Atari 2600 Alien | Persistent AL | Score | 5,699.81 | — | Unverified |
| Atari 2600 Alien | Advantage Learning | Score | 4,990.91 | — | Unverified |
| Atari 2600 Amidar | Persistent AL | Score | 1,451.65 | — | Unverified |
| Atari 2600 Amidar | Advantage Learning | Score | 1,557.43 | — | Unverified |
| Atari 2600 Assault | Persistent AL | Score | 3,304.33 | — | Unverified |
| Atari 2600 Assault | Advantage Learning | Score | 3,661.51 | — | Unverified |
| Atari 2600 Asterix | Advantage Learning | Score | 12,852.08 | — | Unverified |
| Atari 2600 Asterix | Persistent AL | Score | 19,564.9 | — | Unverified |
| Atari 2600 Asteroids | Persistent AL | Score | 1,673.52 | — | Unverified |
| Atari 2600 Asteroids | Advantage Learning | Score | 1,924.42 | — | Unverified |
| Atari 2600 Atlantis | Advantage Learning | Score | 553,591.67 | — | Unverified |
| Atari 2600 Atlantis | Persistent AL | Score | 1,465,250 | — | Unverified |
| Atari 2600 Bank Heist | Advantage Learning | Score | 633.63 | — | Unverified |
| Atari 2600 Bank Heist | Persistent AL | Score | 874.99 | — | Unverified |
| Atari 2600 Battle Zone | Advantage Learning | Score | 28,789.29 | — | Unverified |
| Atari 2600 Battle Zone | Persistent AL | Score | 34,583.07 | — | Unverified |
| Atari 2600 Beam Rider | Persistent AL | Score | 13,145.34 | — | Unverified |
| Atari 2600 Beam Rider | Advantage Learning | Score | 10,054.58 | — | Unverified |
| Atari 2600 Berzerk | Persistent AL | Score | 1,328.25 | — | Unverified |
| Atari 2600 Berzerk | Advantage Learning | Score | 747.26 | — | Unverified |
| Atari 2600 Bowling | Advantage Learning | Score | 57.41 | — | Unverified |
| Atari 2600 Bowling | Persistent AL | Score | 71.59 | — | Unverified |
| Atari 2600 Boxing | Persistent AL | Score | 94.3 | — | Unverified |
| Atari 2600 Boxing | Advantage Learning | Score | 93.94 | — | Unverified |
| Atari 2600 Breakout | Advantage Learning | Score | 425.32 | — | Unverified |
| Atari 2600 Breakout | Persistent AL | Score | 431.89 | — | Unverified |
| Atari 2600 Centipede | Persistent AL | Score | 4,539.55 | — | Unverified |
| Atari 2600 Centipede | Advantage Learning | Score | 4,225.18 | — | Unverified |
| Atari 2600 Chopper Command | Advantage Learning | Score | 5,431.36 | — | Unverified |
| Atari 2600 Chopper Command | Persistent AL | Score | 5,734.93 | — | Unverified |
| Atari 2600 Crazy Climber | Advantage Learning | Score | 123,410.71 | — | Unverified |
| Atari 2600 Crazy Climber | Persistent AL | Score | 130,002.71 | — | Unverified |
| Atari 2600 Defender | Advantage Learning | Score | 30,643.59 | — | Unverified |
| Atari 2600 Defender | Persistent AL | Score | 32,038.93 | — | Unverified |
| Atari 2600 Demon Attack | Advantage Learning | Score | 27,153.48 | — | Unverified |
| Atari 2600 Demon Attack | Persistent AL | Score | 70,908.17 | — | Unverified |
| Atari 2600 Double Dunk | Persistent AL | Score | -2.51 | — | Unverified |
| Atari 2600 Double Dunk | Advantage Learning | Score | -0.15 | — | Unverified |
| Atari 2600 Elevator Action | Persistent AL | Score | 29,100 | — | Unverified |
| Atari 2600 Elevator Action | Advantage Learning | Score | 27,088.89 | — | Unverified |
| Atari 2600 Enduro | Advantage Learning | Score | 1,252.7 | — | Unverified |
| Atari 2600 Enduro | Persistent AL | Score | 1,343.1 | — | Unverified |
| Atari 2600 Fishing Derby | Persistent AL | Score | 28.13 | — | Unverified |
| Atari 2600 Fishing Derby | Advantage Learning | Score | 21.32 | — | Unverified |
| Atari 2600 Freeway | Persistent AL | Score | 32.3 | — | Unverified |
| Atari 2600 Freeway | Advantage Learning | Score | 31.72 | — | Unverified |
| Atari 2600 Frostbite | Advantage Learning | Score | 2,305.82 | — | Unverified |
| Atari 2600 Frostbite | Persistent AL | Score | 3,248.96 | — | Unverified |
| Atari 2600 Gopher | Advantage Learning | Score | 11,912.68 | — | Unverified |
| Atari 2600 Gopher | Persistent AL | Score | 10,611.81 | — | Unverified |