Mavrin 19 A
Mavrin 19 A
QR update QR update
Case 1: Figure 1a shows a deterministic environment where ment is reported in Figure 10.
the data is generated by a degenerate distribution. In this
We also compared our algorithm with QR-DQN in a chal-
case, the intermediate estimate of the distribution (middle
lenging 3D driving simulator (CARLA). Results show that
plot) contains only the information about parametric uncer-
our algorithm achieves near-optimal safety rewards twice
tainty. Here, parametric uncertainty comes from the error
faster than QRDQN.
in the estimation of the quantiles. The left sub-plot shows
estimation from the initialized parameters for the distribu- In the rest of this paper, we first present some preliminaries
tion estimator. The middle sub-plot shows the estimated of RL Section 2. In Section 3, we then study the challenges
distribution converges closer to the true distribution on the posed by the mixture of parametric and intrinsic uncertain-
right sub-plot. ties, and propose a solution to suppress the intrinsic uncer-
tainty. We also propose a truncated variance estimation for
Case 2: Figure 1b shows a stochastic environment, where
exploration bonus in this section. In Section 4, we present
the data is generated by a non-degenerate (stationary) distri-
empirical results in Atari games. Section 5 contains results
bution. In this case, the intermediate estimated distribution
on CARLA. Section 6 an overview of related work, and
is the result of both parametric and intrinsic uncertainties.
Section 7 contains conclusion.
In the middle plot, the distribution estimator (QR) models
randomness from both parametric and intrinsic uncertain-
ties, and it is hard to split them. The parametric uncertainty 2. Background
does go away over time and converge to the true distribution
2.1. Reinforcement Learning
on the right sub-plot. Our main insight in this paper is that
the upper bound for a state-action value estimate shrinks at We consider a Markov Decision Process (MDP) of a state
a certain rate (See Section 3 for details). Specifically, the space S, an action space A, a reward “function” R : S ×
error of the quantile estimator is known to converge asymp- A → R, a transition kernel p : S × A × S → [0, 1],
totically in distribution to the Normal distribution (Koenker, and a discount ratio γ ∈ [0, 1). In this paper we treat the
2005). By treating the estimated distribution during learn- reward “function” R as a random variable to emphasize its
ing as sub-normal we can estimate the upper bound of the stochasticity. Bandit setting is a special case of the general
state-action values with a high confidence (by applying Ho- RL setting, where we usually only have one state.
effdings inequality).
We use π : S × A → [0, 1] to denote a stochastic policy.
This example illustrates distributions learned via distribu- We use Z π (s, a) to denote the random variable of the sum
tional methods (such as distributional RL algorithms) model of the discounted rewards in the future, following the policy
the randomness arising from both intrinsic and parametric π and starting from the state s and the action a. We have
. P∞
uncertainties. In this paper, we study how to take advantage Z π (s, a) = t=0 γ t R(St , At ), where S0 = s, A0 = a and
of distributions learned by distributional RL methods for St+1 ∼ p(·|St , At ), At ∼ π(·|St ). The expectation of the
efficient exploration in the face of uncertainty. random variable Z π (s, a) is
.
To be more specific, we use Quantile Regression Deep-Q- Qπ (s, a) = Eπ,p,R [Z π (s, a)]
Network (QR-DQN, (Dabney et al., 2017)) to learn the
which is usually called the state-action value function. In
distribution of value function. We start with an examination
general RL setting, we are usually interested in finding an
of the two uncertainties and a naive solution that leaves the ∗
optimal policy π ∗ , such that Qπ (s, a) ≥ Qπ (s, a) holds
intrinsic uncertainty unsupressed. We construct a counter
for any (π, s, a). All the possible optimal policies share
example in which this naive solution fails to learn. The
the same optimal state-action value function Q∗ , which is
intrinsic uncertainty persists and leads the naive solution to
the unique fixed point of the Bellman optimality operator
favor actions with higher variances. To suppress the intrinsic
(Bellman, 2013),
uncertainty, we apply a decaying schedule to improve the
.
naive solution. Q(s, a) = T Q(s, a) = E[R(s, a)] + γEs0 ∼p [max 0
Q(s0 , a0 )]
a
One interesting finding in our experiments is that the distri- Based on the Bellman optimality operator, Watkins & Dayan
butions learned by QR-DQN can be asymmetric. By using (1992) proposed Q-learning to learn the optimal state-action
the upper quantiles of the estimated distribution (Mullooly, value function Q∗ for control. At each time step, we update
1988), we estimate an optimistic exploration bonus for QR- Q(s, a) as
DQN.
Q(s, a) ← Q(s, a) + α(r + γ max
0
Q(s0 , a0 ) − Q(s, a))
a
We evaluated our algorithm in 49 Atari games (Bellemare
et al., 2013). Our approach achieved 483 % average gain where α is a step size and (s, a, r, s0 ) is a transition. There
in cumulative rewards over QR-DQN. The overall improve- have been many work extending Q-learning to linear func-
tion approximation (Sutton & Barto, 2018; Szepesvári,
Distributional Reinforcement Learning for Efficient Exploration
2010). Mnih et al. (2015) combined Q-learning with deep where δx denote a Dirac at x ∈ R, and each θi is an esti-
neural network function approximators, resulting the Deep- mation of the quantile corresponding to the quantile level
. .
Q-Network (DQN). Assume the Q function is parameterized (a.k.a. quantile index) τ̂i = τi−12+τi with τi = Ni for
by a network θ, at each time step, DQN performs a stochas- 0 ≤ i ≤ N . The state-action value Q(s, a) is then ap-
PN
tic gradient descent to update θ minimizing the loss proximated by N1 i=1 θi (s, a). Such approximation of a
1 distribution is referred to as quantile approximation.
(rt+1 + γ max Qθ− (st+1 , a) − Qθ (st , at ))2
2 a Similar to the Bellman optimality operator in mean-centered
−
where θ is target network (Mnih et al., 2015), which is RL, we have the distributional Bellman optimality operator
a copy of θ and is synchronized with θ periodically, and for control in distributional RL,
(st , at , rt+1 , st+1 ) is a transition sampled from a experi- .
T Z(s, a) = R(s, a) + γZ(s0 , arg max Ep,R [Z(s0 , a0 )])
ence replay buffer (Mnih et al., 2015), which is a first-in- 0 a
first-out queue storing previously experienced transitions. s0 ∼ p(·|s, a)
Decorrelating representation has shown to speed up DQN
significantly (Mavrin et al., 2019a). For simplicity, in this Based on the distributional Bellman optimality operator,
paper we will focus on the case without decorrelation. Dabney et al. (2017) proposed to train quantile estimations
(i.e., {qi }) via the Huber quantile regression loss (Huber,
2.2. Quantile Regression 1964). To be more specific, at time step t the loss is
N N
The core idea behind QR-DQN is the Quantile Regression 1 XXh κ i
introduced by the seminal paper (Koenker & Bassett Jr, ρτ̂i yt,i0 − θi (st , at )
N i=1 0
1978). This approach gained significant attention in the i =1
where 3. Algorithm
ρτ (u) = u(τ − Iu<0 ) = τ |u|Iu≥0 + (1 − τ )|u|Iu<0 (2) In this section we present our method. First, we study the
issue of the mixture of parametric and intrinsic uncertainties
is the weighted sum of residuals. Weights are proportional in the estimated distributions learned by QR approach. We
to the counts of the residual signs and order of the estimated show that the intrinsic uncertainty has to be suppressed
quantile τ . For higher quantiles positive residuals get higher in calculating exploration bonus and introduce a decaying
weight and vice versa. If τ = 21 , then the estimate of the schedule to achieve this.
median for yi is θ1 (yi |xi ) = xi β̂, with β̂ = arg min L(β).
Second, in a simple example where the distribution is asym-
metric, we show exploration bonus from truncated variance
2.3. Distributional RL outperforms bonus from the variance. In fact, we did find
Instead of learning the expected return Q, distributional that the distributions learned by QR-DQN (in Atari games)
RL focuses on learning the full distribution of the random can be asymmetric. Thus we combine the truncated variance
variable Z directly (Jaquette, 1973; Bellemare et al., 2017; for exploration in our method.
Mavrin et al., 2019b). There are various approaches to
represent a distribution in RL setting (Bellemare et al., 2017; 3.1. The issue of intrinsic uncertainty
Dabney et al., 2018; Barth-Maron et al., 2018). In this paper,
A naive approach to exploration would be to use the variance
we focus on the quantile representation (Dabney et al., 2017)
of the estimated distribution as a bonus. We provide an
used in QR-DQN, where the distribution of Z is represented
illustrative counter example. Consider a multi-armed bandit
by a uniform mix of N supporting quantiles:
environment with 10 arms where each arm’s reward follows
N normal distribution N (µk , σk ). In each run, means {µk }k
. 1 X
Zθ (s, a) = δθ (s,a) are drawn from standard normal. Standard deviation of the
N i=1 i
best arm is set to 1.0, other arms’ standard deviations are
Distributional Reinforcement Learning for Efficient Exploration
(a) Environment with Symmetric distributions. (b) Environment with Asymmetric distributions.
N
2 1 X
σ+ = (θ N − θi )2 (7)
2N N 2
i= 2
DLTV
QR-DQN-1
Figure 6. Median human-normalized performance across 49 Figure 7. Online training curves for DLTV (with decaying schedule
games. and with constant schedule) on the game of Venture.
0 ∗ −
3: T θj = r P
+ γθj (x
P,a ;w )
1
4: L(w) = i N j [ρτ̂i (T θj − θi (x, a; w))] Figure 8. The naive exploration bonus and decaying bonus used
5: w0 = arg minw L(w) for DLTV in Pong.
Ensure: w0 {Updated weights of θ}
References Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-
vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and
Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. Coherent
Silver, D. Rainbow: Combining improvements in deep
measures of risk. Mathematical finance, 9(3):203–228,
reinforcement learning. arXiv:1710.02298, 2017.
1999.
Huber, P. J. Robust estimation of a location parameter. The
Auer, P. Using confidence bounds for exploitation-
Annals of Mathematical Statistics, 35(1):73–101, 1964.
exploration trade-offs. Journal of Machine Learning
Research, 3(3):397–422, 2002. Huber, P. J. Robust statistics. International Encyclopedia of
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, Statistical Science, 35(1):1248–1251, 2011.
W., Horgan, D., Muldal, A., Heess, N., and Lillicrap, T. Jaquette, S. C. Markov decision processes with a new opti-
Distributed distributional deterministic policy gradients. mality criterion: Discrete time. The Annals of Statistics,
arXiv:1804.08617, 2018. 1(3):496–505, 1973.
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Sax- Kaufmann, E., Cappé, O., and Garivier, A. On bayesian
ton, D., and Munos, R. Unifying count-based exploration upper confidence bounds for bandit problems. AISTAT,
and intrinsic motivation. NIPS, 2016. 2012.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
Koenker, R. Quantile Regression. Econometric Society
The arcade learning environment: An evaluation plat-
Monographs. Cambridge University Press, 2005.
form for general agents. Journal of Artificial Intelligence
Research, 47:253–279, 2013. Koenker, R. and Bassett Jr, G. Regression quantiles. Econo-
metrica: Journal of the Econometric Society, 46(1):33–
Bellemare, M. G., Dabney, W., and Munos, R. A
50, 1978.
distributional perspective on reinforcement learning.
arXiv:1707.06887, 2017. Lai, T. L. and Robbins, H. Asymptotically efficient adaptive
allocation rules. Advances in Applied Mathematics, 6(1):
Bellman, R. Dynamic programming. Courier Corporation,
4–22, 1985.
2013.
Chen, C., Qian, J., Yao, H., Luo, J., Zhang, H., and Liu, Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J.,
W. Towards comprehensive maneuver decisions for lane Hausknecht, M., and Bowling, M. Revisiting the arcade
change using reinforcement learning. NIPS Workshop on learning environment: Evaluation protocols and open
Machine Learning for Intelligent Transportation Systems problems for general agents. arXiv:1709.06009, 2017.
(MLITS), 2018. Mavrin, B., Yao, H., and Kong, L. Deep reinforcement
Chen, R. Y., Sidor, S., Abbeel, P., and Schulman, J. Ucb learning with decorrelation. arxiv:1903.07765, 2019a.
exploration via q-ensembles. arXiv:1706.01502, 2017. Mavrin, B., Zhang, S., Yao, H., and Kong, L. Exploration in
Dabney, W., Rowland, M., Bellemare, M. G., and Munos, the face of parametric and intrinsic uncertainties. AAMAS,
R. Distributional reinforcement learning with quantile 2019b.
regression. arXiv:1710.10044, 2017. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Im- J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
plicit quantile networks for distributional reinforcement land, A. K., Ostrovski, G., et al. Human-level control
learning. arXiv:1806.06923, 2018. through deep reinforcement learning. Nature, 518(7540):
529, 2015.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and
Koltun, V. Carla: An open urban driving simulator. Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H.,
arXiv:1711.03938, 2017. and Tanaka, T. Parametric return density estimation for
reinforcement learning. arXiv:1203.3497, 2012.
Fridman, L., Jenik, B., and Terwilliger, J. Deeptraffic: Driv-
ing fast through dense traffic with deep reinforcement Mullooly, J. P. The variance of left-truncated continuous
learning. arxiv:1801.02805, 2018. nonnegative distributions. The American Statistician, 42
(3):208–210, 1988.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and
Stahel, W. A. Robust statistics: the approach based on O’Donoghue, B., Osband, I., Munos, R., and Mnih,
influence functions, volume 196. John Wiley & Sons, V. The uncertainty bellman equation and exploration.
2011. arXiv:1709.05380, 2017.
Distributional Reinforcement Learning for Efficient Exploration
Zhang, S., Mavrin, B., Yao, H., Kong, L., and Liu, B.
QUOTA: The quantile option architecture for reinforce-
ment learning. AAAI, 2019.
Distributional Reinforcement Learning for Efficient Exploration
DLTV gain
ties
DLTV loss
Figure 10. Cumulative rewards performance comparison of DLTV and QR-DQN-1. The bars represent relative gain/loss of DLTV over
QR-DQN-1.