0% found this document useful (0 votes)
61 views11 pages

Mavrin 19 A

Uploaded by

Mr G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views11 pages

Mavrin 19 A

Uploaded by

Mr G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Distributional Reinforcement Learning for Efficient Exploration

Borislav Mavrin 1 2 Hengshuai Yao 3 Linglong Kong 1 2 Kaiwen Wu 4 Yaoliang Yu 4

Abstract Deterministic environment


In distributional reinforcement learning (RL), the
inty tion
tion rta ibu
estimated distribution of value function models u u nce istr
trib etr
ic ed
dis Tru
both the parametric and intrinsic uncertainties. ial am
Init Par
We propose a novel and efficient exploration
method for deep RL that has two components.
The first is a decaying schedule to suppress the
intrinsic uncertainty. The second is an exploration
bonus calculated from the upper quantiles of the QR update QR update

learned distribution. In Atari 2600 games, our (a) Intrinsic uncertainty.


method outperforms QR-DQN in 12 out of 14
hard games (achieving 483 % average gain across Stochastic environment
49 games in cumulative rewards over QR-DQN s ic
trin tion
with a big win in Venture). We also compared our ion + In ibu
but tric inty istr
distri e
am rta ed
algorithm with QR-DQN in a challenging 3D driv- ial Par unce Tru
Init
ing simulator (CARLA). Results show that our
algorithm achieves near-optimal safety rewards
twice faster than QRDQN.

QR update QR update

1. Introduction (b) Intrinsic and parametric uncertainties.


Exploration is a long standing problem in Reinforcement
Learning (RL), where optimism in the face of uncertainty is Figure 1. Uncertainties in deterministic and stochastic environ-
ments.
one fundamental principle (Lai & Robbins, 1985; Strehl &
Littman, 2005). Here the uncertainty refers to parametric
uncertainty, which arises from the variance in the estimates
of certain parameters given finite samples. Both count-based of future return. In the limit, distributional RL captures the
methods (Auer, 2002; Kaufmann et al., 2012; Bellemare intrinsic uncertainty of an MDP (Bellemare et al., 2017;
et al., 2016; Ostrovski et al., 2017; Tang et al., 2017) and Dabney et al., 2017; 2018; Rowland et al., 2018). Intrinsic
Bayesian methods (Kaufmann et al., 2012; Chen et al., 2017; uncertainty arises from the stochasticity of the environment,
O’Donoghue et al., 2017) follow this optimism principle. In which is parameter and sample independent. However, it is
this paper, we propose to use distributional RL methods to not trivial to quantify the effects of parametric and intrinsic
achieve this optimism. uncertainties in distribution learning. To investigate this,
let us look closer at a simple setup of distribution learning.
Different from classical RL methods, where an expectation
Here we use Quantile Regression (QR) (detailed in Section
of value function is learned (Sutton, 1988; Watkins & Dayan,
2.2), but the example presented here holds for other distribu-
1992; Mnih et al., 2015), distributional RL methods (Jaque-
tion learning methods. Here the random samples are drawn
tte, 1973; Bellemare et al., 2017) maintain a full distribution
from any stationary distribution. The initial estimated distri-
1
University of Alberta 2 Huawei Noah’s Ark 3 Huawei Hi-Silicon bution is set to be the uniform one (left plots). At each time
4
University of Waterloo. Correspondence to: Hengshuai Yao step, QR updates its estimate in an on-line fashion by min-
<[email protected]>. imizing some loss function. In the limit the estimated QR
Proceedings of the 36 th International Conference on Machine distribution converges to the true distribution (right plots).
Learning, Long Beach, California, PMLR 97, 2019. Copyright The two middle plots examine the intermediate estimated
2019 by the author(s). distributions before convergence in two distinct cases.
Distributional Reinforcement Learning for Efficient Exploration

Case 1: Figure 1a shows a deterministic environment where ment is reported in Figure 10.
the data is generated by a degenerate distribution. In this
We also compared our algorithm with QR-DQN in a chal-
case, the intermediate estimate of the distribution (middle
lenging 3D driving simulator (CARLA). Results show that
plot) contains only the information about parametric uncer-
our algorithm achieves near-optimal safety rewards twice
tainty. Here, parametric uncertainty comes from the error
faster than QRDQN.
in the estimation of the quantiles. The left sub-plot shows
estimation from the initialized parameters for the distribu- In the rest of this paper, we first present some preliminaries
tion estimator. The middle sub-plot shows the estimated of RL Section 2. In Section 3, we then study the challenges
distribution converges closer to the true distribution on the posed by the mixture of parametric and intrinsic uncertain-
right sub-plot. ties, and propose a solution to suppress the intrinsic uncer-
tainty. We also propose a truncated variance estimation for
Case 2: Figure 1b shows a stochastic environment, where
exploration bonus in this section. In Section 4, we present
the data is generated by a non-degenerate (stationary) distri-
empirical results in Atari games. Section 5 contains results
bution. In this case, the intermediate estimated distribution
on CARLA. Section 6 an overview of related work, and
is the result of both parametric and intrinsic uncertainties.
Section 7 contains conclusion.
In the middle plot, the distribution estimator (QR) models
randomness from both parametric and intrinsic uncertain-
ties, and it is hard to split them. The parametric uncertainty 2. Background
does go away over time and converge to the true distribution
2.1. Reinforcement Learning
on the right sub-plot. Our main insight in this paper is that
the upper bound for a state-action value estimate shrinks at We consider a Markov Decision Process (MDP) of a state
a certain rate (See Section 3 for details). Specifically, the space S, an action space A, a reward “function” R : S ×
error of the quantile estimator is known to converge asymp- A → R, a transition kernel p : S × A × S → [0, 1],
totically in distribution to the Normal distribution (Koenker, and a discount ratio γ ∈ [0, 1). In this paper we treat the
2005). By treating the estimated distribution during learn- reward “function” R as a random variable to emphasize its
ing as sub-normal we can estimate the upper bound of the stochasticity. Bandit setting is a special case of the general
state-action values with a high confidence (by applying Ho- RL setting, where we usually only have one state.
effdings inequality).
We use π : S × A → [0, 1] to denote a stochastic policy.
This example illustrates distributions learned via distribu- We use Z π (s, a) to denote the random variable of the sum
tional methods (such as distributional RL algorithms) model of the discounted rewards in the future, following the policy
the randomness arising from both intrinsic and parametric π and starting from the state s and the action a. We have
. P∞
uncertainties. In this paper, we study how to take advantage Z π (s, a) = t=0 γ t R(St , At ), where S0 = s, A0 = a and
of distributions learned by distributional RL methods for St+1 ∼ p(·|St , At ), At ∼ π(·|St ). The expectation of the
efficient exploration in the face of uncertainty. random variable Z π (s, a) is
.
To be more specific, we use Quantile Regression Deep-Q- Qπ (s, a) = Eπ,p,R [Z π (s, a)]
Network (QR-DQN, (Dabney et al., 2017)) to learn the
which is usually called the state-action value function. In
distribution of value function. We start with an examination
general RL setting, we are usually interested in finding an
of the two uncertainties and a naive solution that leaves the ∗
optimal policy π ∗ , such that Qπ (s, a) ≥ Qπ (s, a) holds
intrinsic uncertainty unsupressed. We construct a counter
for any (π, s, a). All the possible optimal policies share
example in which this naive solution fails to learn. The
the same optimal state-action value function Q∗ , which is
intrinsic uncertainty persists and leads the naive solution to
the unique fixed point of the Bellman optimality operator
favor actions with higher variances. To suppress the intrinsic
(Bellman, 2013),
uncertainty, we apply a decaying schedule to improve the
.
naive solution. Q(s, a) = T Q(s, a) = E[R(s, a)] + γEs0 ∼p [max 0
Q(s0 , a0 )]
a
One interesting finding in our experiments is that the distri- Based on the Bellman optimality operator, Watkins & Dayan
butions learned by QR-DQN can be asymmetric. By using (1992) proposed Q-learning to learn the optimal state-action
the upper quantiles of the estimated distribution (Mullooly, value function Q∗ for control. At each time step, we update
1988), we estimate an optimistic exploration bonus for QR- Q(s, a) as
DQN.
Q(s, a) ← Q(s, a) + α(r + γ max
0
Q(s0 , a0 ) − Q(s, a))
a
We evaluated our algorithm in 49 Atari games (Bellemare
et al., 2013). Our approach achieved 483 % average gain where α is a step size and (s, a, r, s0 ) is a transition. There
in cumulative rewards over QR-DQN. The overall improve- have been many work extending Q-learning to linear func-
tion approximation (Sutton & Barto, 2018; Szepesvári,
Distributional Reinforcement Learning for Efficient Exploration

2010). Mnih et al. (2015) combined Q-learning with deep where δx denote a Dirac at x ∈ R, and each θi is an esti-
neural network function approximators, resulting the Deep- mation of the quantile corresponding to the quantile level
. .
Q-Network (DQN). Assume the Q function is parameterized (a.k.a. quantile index) τ̂i = τi−12+τi with τi = Ni for
by a network θ, at each time step, DQN performs a stochas- 0 ≤ i ≤ N . The state-action value Q(s, a) is then ap-
PN
tic gradient descent to update θ minimizing the loss proximated by N1 i=1 θi (s, a). Such approximation of a
1 distribution is referred to as quantile approximation.
(rt+1 + γ max Qθ− (st+1 , a) − Qθ (st , at ))2
2 a Similar to the Bellman optimality operator in mean-centered

where θ is target network (Mnih et al., 2015), which is RL, we have the distributional Bellman optimality operator
a copy of θ and is synchronized with θ periodically, and for control in distributional RL,
(st , at , rt+1 , st+1 ) is a transition sampled from a experi- .
T Z(s, a) = R(s, a) + γZ(s0 , arg max Ep,R [Z(s0 , a0 )])
ence replay buffer (Mnih et al., 2015), which is a first-in- 0 a
first-out queue storing previously experienced transitions. s0 ∼ p(·|s, a)
Decorrelating representation has shown to speed up DQN
significantly (Mavrin et al., 2019a). For simplicity, in this Based on the distributional Bellman optimality operator,
paper we will focus on the case without decorrelation. Dabney et al. (2017) proposed to train quantile estimations
(i.e., {qi }) via the Huber quantile regression loss (Huber,
2.2. Quantile Regression 1964). To be more specific, at time step t the loss is
N N
The core idea behind QR-DQN is the Quantile Regression 1 XXh κ i
introduced by the seminal paper (Koenker & Bassett Jr, ρτ̂i yt,i0 − θi (st , at )
N i=1 0
1978). This approach gained significant attention in the i =1

field of Theoretical and Applied Statistics and might not be .


where yt,i0 = rt +
well known in other fields. For that reason we give a brief PN 0
 κ
γθi0 st+1 , arg maxa0 i=1 θi (st+1 , a ) and ρτ̂i (x)
introduction here. Let us first consider QR in the supervised .
= |τ̂i − I{x < 0}|Lκ (x), where I is the indicator function
learning. Given data {(xi , yi )}i , we want to compute the
and Lκ is the Huber loss,
quantile of y corresponding the quantile level τ . linear
(
quantile regression loss is defined as: 1
. 2 x2 if x ≤ κ
Lκ (x) = 1
κ(|x| − 2 κ) otherwise
X
L(β) = ρτ (yi − xi β) (1)
i

where 3. Algorithm
ρτ (u) = u(τ − Iu<0 ) = τ |u|Iu≥0 + (1 − τ )|u|Iu<0 (2) In this section we present our method. First, we study the
issue of the mixture of parametric and intrinsic uncertainties
is the weighted sum of residuals. Weights are proportional in the estimated distributions learned by QR approach. We
to the counts of the residual signs and order of the estimated show that the intrinsic uncertainty has to be suppressed
quantile τ . For higher quantiles positive residuals get higher in calculating exploration bonus and introduce a decaying
weight and vice versa. If τ = 21 , then the estimate of the schedule to achieve this.
median for yi is θ1 (yi |xi ) = xi β̂, with β̂ = arg min L(β).
Second, in a simple example where the distribution is asym-
metric, we show exploration bonus from truncated variance
2.3. Distributional RL outperforms bonus from the variance. In fact, we did find
Instead of learning the expected return Q, distributional that the distributions learned by QR-DQN (in Atari games)
RL focuses on learning the full distribution of the random can be asymmetric. Thus we combine the truncated variance
variable Z directly (Jaquette, 1973; Bellemare et al., 2017; for exploration in our method.
Mavrin et al., 2019b). There are various approaches to
represent a distribution in RL setting (Bellemare et al., 2017; 3.1. The issue of intrinsic uncertainty
Dabney et al., 2018; Barth-Maron et al., 2018). In this paper,
A naive approach to exploration would be to use the variance
we focus on the quantile representation (Dabney et al., 2017)
of the estimated distribution as a bonus. We provide an
used in QR-DQN, where the distribution of Z is represented
illustrative counter example. Consider a multi-armed bandit
by a uniform mix of N supporting quantiles:
environment with 10 arms where each arm’s reward follows
N normal distribution N (µk , σk ). In each run, means {µk }k
. 1 X
Zθ (s, a) = δθ (s,a) are drawn from standard normal. Standard deviation of the
N i=1 i
best arm is set to 1.0, other arms’ standard deviations are
Distributional Reinforcement Learning for Efficient Exploration

Naive exploration bonus


I ntri
nsic u
ncertatin
Para ty
metric
uncertaint
y
Time steps

(a) Naive exploration bonus.

Figure 3. Performance of naive exploration and decaying explo-


Decaying bonus ration bonus in the counter example.
I ntri
nsic u
ncertatin
Para ty
metric
uncertaint theory (Koenker, 2005), it is known that the parametric
y
Time steps
uncertainty decays at the following rate:
r
(b) Decaying exploration bonus. log t
ct = c (5)
t
Figure 2. Exploration in the face of intrinsic and parametric uncer-
where c is a constant factor.
tainties.
We apply this new schedule to the counter example where
set to 5. In the setting of multi-armed bandits, this approach the naive solution fails. As shown in Figure 3, this decaying
leads to picking the arm a such that schedule significantly outperforms the naive exploration
bonus.
a = arg max µ̄k + cσk (3)
k
3.2. Assymetry and truncated variance
where µ̄k and σk2 are the estimated mean and variance of
the k-th arm, computed from the corresponding quantile QR has no restriction on the family of distributions it can
distribution estimation. represent. In fact, the learned distribution can be asymmet-
ric, defined by mean 6= median. From Figure 5 it can be
Figure 3 shows that naive exploration bonus fails. Fig- seen that the distribution estimated by QR-DQN-1 is mostly
ure 2a illustrates the reason for the failure of naive explo- asymmetric. At the end of training, agent achieved nearly
ration bonus. The estimated QR distribution is a mixture of maximum score. Hence, the distributions correspond to the
parametric and intrinsic uncertainties. Recall, as learning near-optimal policy, but they are not symmetric.
progresses the parametric uncertainty vanishes and the in-
trinsic uncertainty stays (Figure 2b). Therefore, this naive For the sake of the argument, consider a simple decompo-
exploration bonus will tend to be biased towards intrinsic sition of the variance of the QR’s estimated distribution
variation, which hurts performance. Note that the best arm into the two terms: the Right Truncated and Left Truncated
has a low intrinsic variation. It is not chosen since its explo- variances 1 :
ration bonus term is much smaller than the other arms as N
2 1 X
parametric uncertainty vanishes in all arms. σ = (θ̄ − θi )2
N i=1
The major obstacle in using the estimated distribution by N
2 N
QR for exploration is the composition of parametric and 2 X 2 X
intrinsic uncertainties, whose variance is measured by the = (θ̄ − θi )2 + (θ̄ − θi )2
N i=1 N
term σk2 in (3). To suppress the intrinsic uncertainty, we i= N
2 +1

propose a decaying schedule in the form of a multiplier to 2 2


=σrt + σlt ,
σk2 : 2 2
a = arg max µ̄k + ct σ̄k (4) where σrt is the Right Truncated Variance and σlt is the
k right. To simplify notation we assume N is an even number
Figure 2b depicts the exploration bonus resulting from the 1
Note: Right truncation means dropping left part of the distri-
application of decaying schedule. From the classical QR bution with respect to the mean
Distributional Reinforcement Learning for Efficient Exploration

Truncated Variance Truncated Variance


Variance Variance

(a) Environment with Symmetric distributions. (b) Environment with Asymmetric distributions.

Figure 4. Environments with symmetric and asymmetric rewards distributions.

increase stability, we use the left truncated measure of the


2
variability, σ+ , based on the median rather than the mean
due to its well-known statistical robustness (Huber, 2011;
Hampel et al., 2011):

N
2 1 X
σ+ = (θ N − θi )2 (7)
2N N 2
i= 2

where θi ’s are Ni -th quantiles. By combining decaying


2
schedule from (5) with σ+ from (7) we obtain a new explo-
ration bonus for picking an action, which we call Decaying
Left Truncated Variance (DLTV).
In order to empirically validate our new approach we em-
Figure 5. Pong. Empirical measure of the distribution learned for ploy a multi-armed bandits environment with asymmetri-
a single action obtained from QR-DQN-1 during training, showing cally distributed rewards. In each run the means of arms
very asymmetric.
{µk }k are drawn from standard normal distribution. The
best arm’s reward follow µk + E[LogN ormal(0, 1)] −
here. The Right Truncated Variance tells about lower tail LogN ormal(0, 1). Other arms rewards follow µk +
variability and the Left Truncated Variance tells about upper LogN ormal(0, 1) − E[LogN ormal(0, 1)]. We compare
tail variability. In general, the two variances are not equal. 2 the performance of both exploration methods in another,
If the distribution is symmetric, then the two are the same. symmetric environment with rewards following the normal
distribution centered at corresponding means (same as the
The Truncated Variance is equivalent to the Tail Conditional asymmetric environment) with unit variance.
Variance (TCV):
The results are presented in Figure 4. With asymmetric re-
T CVx (θ) = V ar(θ − θ̄|θ > x) (6) ward distributions, the truncated variance exploration bonus
significantly outperforms the naive variance exploration
defined in (Valdez, 2005). For instantiating optimism in the bonus. In addition, the performance of truncated variance is
face of uncertainty, the upper tail variability is more relevant slightly better in the symmetric case.
than the lower tail, especially if the estimated distribution
2 3.3. DLTV for Deep RL
is asymmetric (Valdez, 2005). Intuitively speaking, σlt is
2
more optimistic. σlt is biased towards positive rewards. To
So far, we introduced the decaying schedule to control the
2
Consider discrete empirical distribution with support parametric part of the composite uncertainty. Additionally,
{−1, 0, 2} and probability atoms { 13 , 31 , 31 }. we introduced a truncated variance to improve performance
Distributional Reinforcement Learning for Efficient Exploration

DLTV

QR-DQN-1

Figure 6. Median human-normalized performance across 49 Figure 7. Online training curves for DLTV (with decaying schedule
games. and with constant schedule) on the game of Venture.

in environments with asymmetric distributions. These ideas


generalize in a straightforward fashion to the Deep RL set- Naive exploration bonus
ting. Algorithm 1 outlines DLTV for Deep RL. Action se-
lection step in line 2 of Algorithm 1 uses exploration bonus
2
in the form of σ+ defined in (7) and schedule ct defined in
(5). Decaying bonus

Algorithm 1 DLTV for Deep RL


Require: w, w− , (x, a, r, x0 ), γ ∈ [0, 1) {network weights,
sampled transition, discount factor}
1: Q(x0 , a0 ) = j qj θj (x0 , a0 ; w− )
P
q
2: a∗ = arg maxa0 (Q(x, a0 ) + ct σ+ 2)

0 ∗ −
3: T θj = r P
+ γθj (x
P,a ;w )
1
4: L(w) = i N j [ρτ̂i (T θj − θi (x, a; w))] Figure 8. The naive exploration bonus and decaying bonus used
5: w0 = arg minw L(w) for DLTV in Pong.
Ensure: w0 {Updated weights of θ}

achieved 483 % average gain in cumulative rewards 4 over


Figure 8 presents naive and decaying exploration bonus
QR-DQN-1. Notably the performance gain is obtained in
term from DLTV of QR-DQN during training in Atari Pong.
hard games such as Venture, PrivateEye, Montezuma Re-
Comparison of Figure 8 to Figure 2b reveals the similarity
venge and Seaquest. The median of human normalized
in the behavior of the naive exploration bonus and the de-
performance reported in Figure 6 shows a significant im-
caying exploration bonus. This shows what the raw variance
provement of DLTV over QR-DQN-1. We present learning
looks like in Atari 2600 game and the suppressed intrinsic
curves for all 49 games in the Appendix.
uncertainty leading to a decaying bonus as illustrated in
Figure 2b. The architecture of the network follows (Dabney et al.,
2017). For our experiments we chose the Huber loss with
κ = 1 5 in the work by (Dabney et al., 2017) due to its
4. Atari 2600 Experiments
4
The cumulative reward is a suitable performance measure
We evaluated DLTV on the set of 49 Atari games initially for our experiments, since none of the learning curves exhibit
proposed by (Mnih et al., 2015). Algorithms were eval- plummeting behaviour. Plummeting is characterized by abrupt
uated on 40 million frames3 3 runs per game. The sum- degradation of performance. In such cases the learning curve
mary of the results is presented in Figure 10. Our approach drops to the minimum and stays their indefinitely. A more detailed
discussion of this point is presented in (Machado et al., 2017).
3 5
Equivalently, 10 million agent steps. QR-DQN with κ = 1 is denoted as QR-DQN-1
Distributional Reinforcement Learning for Efficient Exploration

smoothness compared to L1 loss of QR-DQN-0. (Smooth-


ness is better suited for gradient descent methods). We
followed closely (Dabney et al., 2017) in setting the hy-
per parameters, except for the learning rate of the Adam
optimizer which we set to α = 0.0001.
The most significant distinction of our DLTV is the way
the exploration is performed. As opposed to QR-DQN there
is no epsilon greedy exploration schedule in DLTV. The
2
exploration is performed via the σ+ term only (line 2 of
Algorithm 1).
An important hyper parameter which is introduced by DLTV
2
is the schedule, i.e. the sequence of multipliers for σ+ ,
{ct }t . Inqour experiments we used the following schedule
log t
ct = 50 t .
Figure 9. Naive exploration bonus and decaying bonus (as used in
We studied the effect of the schedule in the Atari 2600 game DLTV) for CARLA. DLTV learns significantly faster than DQN
Venture. Figure 7 show that constant schedule for DLTV and QR-DQN, achieving higher rewards for safety driving.
significantly degenerates the performance. These empirical
results show that the decaying schedule in DLTV is very
important.
defined 7 actions: 6 actions for going in different directions
using fixed values for steering angle and throttle and a no op
5. CARLA Experiments action. The training learning curves are presented in Figure
A particularly interesting application of the (Distributional) 9. DLTV significantly outperforms QR-DQN-1 and DQN.
RL approach is driving safety. There has been quite a con- Interestingly QR-DQN-1 performs on par with DQN.
verge of interests in using RL for autonomous driving, e.g.,
see (Sakib et al., 2019; Fridman et al., 2018; Chen et al., 5.2. Driving Safety
2018; Yao et al., 2017). In the classical RL setting the agent A byproduct of Distributional RL is the estimated distri-
only cares about the mean. In Distributional RL the estimate bution of Q(s, a). The access to this density allows for
of the whole distribution allows for the construction of the different approaches to control. For example Morimura
risk-sensitive policies. For that reason we further validate et al. (2012) derive risk-sensitive policies based on the quan-
DLTV in CARLA environment which is a 3D self driving tiles rather than the mean. The reasoning behind such ap-
simulator. proach is to view quantile as a risk metric. For instance,
one particularly interesting risk metric is Value-at-Risk
5.1. Sample efficiency (VaR) which has been in use for a few decades in Finan-
It should be noted that CARLA is a more visually com- cial Industry (Philippe, 2006). Artzner et al. (1999) define
plex environment than Atari 2600, since it is based on a V aRα (X) as P rob(X ≤ −V aRα (X)) = 1 − α, that is
modern Unreal Engine 4 with realistic physics and visual V aRα (X) = (1 − α)th quantile of X.
effects. For the purpose of this study we picked the task It might be easier to understand the idea behind VaR in
in which the ego car has to reach a goal position follow- financial setting. Consider two investments: first investment
ing predefined paths. In each episode the start and goal will lose 1 dollar of its value or more with 10% probability
positions are sampled uniformly from a predefined set of (V aR10% = 1) and second investment will lose 2 dollars or
locations (around 20). We conducted our experiments in more of its value with 5 percent probability (V aR10% = 2).
Town 2. We simplified the reward signal provided in the Second investment is riskier than the first one, that is a risk-
original paper (Dosovitskiy et al., 2017). We assign reward sensitive investor will pick an investment with the higher
of −1.0 for any type of infraction and a a small positive VaR. This same reasoning applies directly to RL setting.
reward for travelling in the correct direction without any Here, instead of investments we deal with actions. risk-
infractions, i.e. 0.001(distancet − distancet+1 ). The in- sensitive policy will pick the action that has highest VaR.
fractions we consider are: collisions with cars, collisions For instance Morimura et al. (2012) showed in a simple envi-
with humans, collisions with static objects, driving on the ronment of Cliff Walk the policy maximizing low quantiles
opposite lane and driving on a sidewalk. The continuous yields paths further away from the dangerous cliff.
action space was discretized in a coarse grain fashion. We
Risk-sensitive policies are not only applicable to toy do-
Distributional Reinforcement Learning for Efficient Exploration

Average distance V aR90% or q0.1 Mean 6. Related Work


between infractions
Opposite lane 4.55 1.35 Tang & Agrawal (2018) combined Bayesian parameter up-
Sidewalk None None dates with distributional RL for efficient exploration. How-
Collision-static None 3.54 ever, they demonstrated improvement in only simple do-
Collision-car 0.70 1.53 mains. Zhang et al. (2019) generated risk-seeking and risk-
Collision-pedestrian 52.33 16.41 averse policies via distributional RL for exploration, making
use of both optimism and pessimism of intrinsic uncertainty.
Average collision impact To our best knowledge, we are the first to use the para-
Collision-static None 509.81 metric uncertainty in the estimated distributions learned by
Collision-car 497.22 1078.76 distributional RL algorithms for exploration.
Collision-pedestrian 40.79 40.70 For optimism in the face of uncertainty in deep RL set-
ting, Bellemare et al. (2016) and Ostrovski et al. (2017)
Distance, km 104.69 98.66 exploited a generative model to enable pseudo-count. Tang
# of evaluation episodes 1000 1000 et al. (2017) combined task-specific features from an auto-
encoder with similarity hashing to count high dimensional
Table 1. Safety performance in CARLA. We compared decision
states. Chen et al. (2017) used Q-ensemble to compute
making using mean and quantile, both are according to the model
trained by DLTV. Recall that DLTV learns a distribution of state-
variance-based exploration bonus. O’Donoghue et al. (2017)
action values, represented by a set of quantile values. On the used uncertainty Bellman equation to propagate the uncer-
middle column is selecting actions using a low quantile for the tainty through time steps. Most of those approaches bring
state-action value function, q0.1 , which is more conservative than in non-negligible computation overhead. In contrast, our
the mean. In 1000 episodes, the total distance driven is 104.69km, DLTV achieves this optimism via distributional RL (QR-
and driving on the opposite lane every 4.55 km. Using the mean DQN in particular) and requires very little extra computa-
for action selection, the total distance driven is 98.66 km and on tion.
opposite lane every 1.35 km. Across all measures, using low quan-
tile achieves better than using mean for action selection, except
that collision rate with car is higher but the collision impact is 7. Conclusions
lower.
Recent advancements in distributional RL, not only estab-
lished new theoretically sound principles but also achieved
state-of-the-art performance in challenging high dimen-
sional environments like Atari 2600. We take a step fur-
ther by studying the learned distributions by QR-DQN, and
discovered the composite effect of intrinsic and parametric
mains. In fact risk sensitive policies is a very important uncertainties is challenging for efficient exploration. In ad-
research question in self-driving. In that respect CARLA dition, the distribution estimated by distributional RL can
is a non trivial domain where risk-sensitive policies can be asymmetric. We proposed a novel decaying scheduling
be thoroughly tested. In (Dosovitskiy et al., 2017) authors to suppress the intrinsic uncertainty, and a truncated vari-
introduce simple safety performance metric such as average ance for calculating exploration bonus, resulting in a new
distance travelled between infractions. In addition to this exploration strategy for QR-DQN. Empirical results showed
metric we also consider the collision impact. This metric that the our method outperforms QR-DQN (with epsilon-
allows one to differentiate policies with the same average greedy strategy) significantly in Atari 2600. Our method
distance between infractions. Given the impact is not avoid- can be combined with other advancements in deep RL, e.g.
able, a good policy should minimize the impact. Rainbow (Hessel et al., 2017), to yield yet better results.
We trained our agent using DLTV approach and dur-
ing evaluation we used risk-sensitive policy derived from
V aR(Q(s, a)90% ) instead of the usual mean. Interestingly,
this approach does employ mean-centered RL at all. We
benchmark this approach against the agent that uses mean
for control. The safety results for the risk-sensitive and
the mean agents are presented in Table 1. It can be seen
that risk-sensitive agent significantly improves safety perfor-
mance across almost all metrics, except for collisions with
cars. However, the impact of colliding with cars is twice
lower for the risk-sensitive agent.
Distributional Reinforcement Learning for Efficient Exploration

References Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-
vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and
Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. Coherent
Silver, D. Rainbow: Combining improvements in deep
measures of risk. Mathematical finance, 9(3):203–228,
reinforcement learning. arXiv:1710.02298, 2017.
1999.
Huber, P. J. Robust estimation of a location parameter. The
Auer, P. Using confidence bounds for exploitation-
Annals of Mathematical Statistics, 35(1):73–101, 1964.
exploration trade-offs. Journal of Machine Learning
Research, 3(3):397–422, 2002. Huber, P. J. Robust statistics. International Encyclopedia of
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, Statistical Science, 35(1):1248–1251, 2011.
W., Horgan, D., Muldal, A., Heess, N., and Lillicrap, T. Jaquette, S. C. Markov decision processes with a new opti-
Distributed distributional deterministic policy gradients. mality criterion: Discrete time. The Annals of Statistics,
arXiv:1804.08617, 2018. 1(3):496–505, 1973.
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Sax- Kaufmann, E., Cappé, O., and Garivier, A. On bayesian
ton, D., and Munos, R. Unifying count-based exploration upper confidence bounds for bandit problems. AISTAT,
and intrinsic motivation. NIPS, 2016. 2012.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
Koenker, R. Quantile Regression. Econometric Society
The arcade learning environment: An evaluation plat-
Monographs. Cambridge University Press, 2005.
form for general agents. Journal of Artificial Intelligence
Research, 47:253–279, 2013. Koenker, R. and Bassett Jr, G. Regression quantiles. Econo-
metrica: Journal of the Econometric Society, 46(1):33–
Bellemare, M. G., Dabney, W., and Munos, R. A
50, 1978.
distributional perspective on reinforcement learning.
arXiv:1707.06887, 2017. Lai, T. L. and Robbins, H. Asymptotically efficient adaptive
allocation rules. Advances in Applied Mathematics, 6(1):
Bellman, R. Dynamic programming. Courier Corporation,
4–22, 1985.
2013.
Chen, C., Qian, J., Yao, H., Luo, J., Zhang, H., and Liu, Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J.,
W. Towards comprehensive maneuver decisions for lane Hausknecht, M., and Bowling, M. Revisiting the arcade
change using reinforcement learning. NIPS Workshop on learning environment: Evaluation protocols and open
Machine Learning for Intelligent Transportation Systems problems for general agents. arXiv:1709.06009, 2017.
(MLITS), 2018. Mavrin, B., Yao, H., and Kong, L. Deep reinforcement
Chen, R. Y., Sidor, S., Abbeel, P., and Schulman, J. Ucb learning with decorrelation. arxiv:1903.07765, 2019a.
exploration via q-ensembles. arXiv:1706.01502, 2017. Mavrin, B., Zhang, S., Yao, H., and Kong, L. Exploration in
Dabney, W., Rowland, M., Bellemare, M. G., and Munos, the face of parametric and intrinsic uncertainties. AAMAS,
R. Distributional reinforcement learning with quantile 2019b.
regression. arXiv:1710.10044, 2017. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Im- J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
plicit quantile networks for distributional reinforcement land, A. K., Ostrovski, G., et al. Human-level control
learning. arXiv:1806.06923, 2018. through deep reinforcement learning. Nature, 518(7540):
529, 2015.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and
Koltun, V. Carla: An open urban driving simulator. Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H.,
arXiv:1711.03938, 2017. and Tanaka, T. Parametric return density estimation for
reinforcement learning. arXiv:1203.3497, 2012.
Fridman, L., Jenik, B., and Terwilliger, J. Deeptraffic: Driv-
ing fast through dense traffic with deep reinforcement Mullooly, J. P. The variance of left-truncated continuous
learning. arxiv:1801.02805, 2018. nonnegative distributions. The American Statistician, 42
(3):208–210, 1988.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and
Stahel, W. A. Robust statistics: the approach based on O’Donoghue, B., Osband, I., Munos, R., and Mnih,
influence functions, volume 196. John Wiley & Sons, V. The uncertainty bellman equation and exploration.
2011. arXiv:1709.05380, 2017.
Distributional Reinforcement Learning for Efficient Exploration

Ostrovski, G., Bellemare, M. G., Oord, A. v. d., and Munos, Acknowledgement


R. Count-based exploration with neural density models.
arXiv:1703.01310, 2017. The correct author list for this paper is Borislav Mavrin,
Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen
Philippe, J. Value at risk: the new benchmark for managing Wu and Yaoliang Yu. Due to time pressure, Shangtong’s
financial risk, 3rd Ed. McGraw-Hill Education, 2006. name was forgotten during submiting. If you cite this paper,
please use this correct author list. The mistake was fixed in
Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., the arxiv version of this paper.
and Teh, Y. W. An analysis of categorical distributional
reinforcement learning. arXiv:1802.08163, 2018.
A. Performance Profiling on Atari Games
Sakib, N., Yao, H., and Zhang, H. Reinforcing classical plan-
ning for adversary driving scenarios. arxiv:1903.08606, Figure 10 shows the performance of DLTV and QR-DQN on
2019. 49 Atari games, which is measured by cumulative rewards
(normalized Area Under the Curve).
Strehl, A. L. and Littman, M. L. A theoretical analysis of
model-based interval estimation. ICML, 2005.
Sutton, R. S. Learning to predict by the methods of temporal
differences. Machine Learning, 3(1):9–44, 1988.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An


introduction (2nd Edition). MIT press, 2018.
Szepesvári, C. Algorithms for Reinforcement Learning.
Morgan and Claypool, 2010.
Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O. X.,
Duan, Y., Schulman, J., DeTurck, F., and Abbeel, P. #
exploration: A study of count-based exploration for deep
reinforcement learning. NIPS, 2017.
Tang, Y. and Agrawal, S. Exploration by distributional
reinforcement learning. arXiv:1805.01907, 2018.

Valdez, E. A. Tail conditional variance for elliptically con-


toured distributions. Belgian Actuarial Bulletin, 5(1):
26–36, 2005.
Watkins, C. J. and Dayan, P. Q-learning. Machine Learning,
8(3-4):279–292, 1992.

Yao, H., Nosrati, M. S., and Rezaee, K. Monte-carlo tree


search vs. model-predictive controller: A track-following
example. NIPS Workshop on Machine Learning for Intel-
ligent Transportation Systems (MLITS), 2017.

Zhang, S., Mavrin, B., Yao, H., Kong, L., and Liu, B.
QUOTA: The quantile option architecture for reinforce-
ment learning. AAAI, 2019.
Distributional Reinforcement Learning for Efficient Exploration

DLTV gain
ties
DLTV loss

At QR-DQN-1 level or above


Below QR-DQN-1 level

Figure 10. Cumulative rewards performance comparison of DLTV and QR-DQN-1. The bars represent relative gain/loss of DLTV over
QR-DQN-1.

You might also like