Fujimoto 18 A
Fujimoto 18 A
algorithm (TD3), an actor-critic algorithm which consid- horizon. Another approach is a reduction in the discount
ers the interplay between function approximation error in factor (Petrik & Scherrer, 2009), reducing the contribution
both policy and value updates. We evaluate our algorithm of each error.
on seven continuous control domains from OpenAI gym
Our method builds on the Deterministic Policy Gradient
(Brockman et al., 2016), where we outperform the state of
algorithm (DPG) (Silver et al., 2014), an actor-critic method
the art by a wide margin.
which uses a learned value estimate to train a deterministic
Given the recent concerns in reproducibility (Henderson policy. An extension of DPG to deep reinforcement learn-
et al., 2017), we run our experiments across a large num- ing, DDPG (Lillicrap et al., 2015), has shown to produce
ber of seeds with fair evaluation metrics, perform abla- state of the art results with an efficient number of iterations.
tion studies across each contribution, and open source both Orthogonal to our approach, recent improvements to DDPG
our code and learning curves (https://github.com/ include distributed methods (Popov et al., 2017), along with
sfujim/TD3). multi-step returns and prioritized experience replay (Schaul
et al., 2016; Horgan et al., 2018), and distributional methods
2. Related Work (Bellemare et al., 2017; Barth-Maron et al., 2018).
400
objective y over multiple updates: 500
300 400
Average Value
y = r + γQθ0 (s0 , a0 ), a0 ∼ πφ0 (s0 ), (3) 300
200
200
where the actions are selected from a target actor network 100 CDQ True CDQ 100
πφ0 . The weights of a target network are either updated DDPG True DDPG
0 0
periodically to exactly match the weights of the current 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time steps (1e6) Time steps (1e6)
network, or by some proportion τ at each time step θ0 ← (a) Hopper-v1 (b) Walker2d-v1
τ θ + (1 − τ )θ0 . This update can be applied in an off-policy
fashion, sampling random mini-batches of transitions from Figure 1. Measuring overestimation bias in the value estimates
an experience replay buffer (Lin, 1992). of DDPG and our proposed method, Clipped Double Q-learning
(CDQ), on MuJoCo environments over 1 million time steps.
4. Overestimation Bias
In Q-learning with discrete actions, the value estimate is occur with slightly stricter conditions. We examine this case
updated with a greedy target y = r + γ maxa0 Q(s0 , a0 ), further in the supplementary material. We denote πapprox
however, if the target is susceptible to error , then the max- and πtrue as the policy with parameters φapprox and φtrue re-
imum over the value along with its error will generally be spectively.
greater than the true maximum, E [maxa0 (Q(s0 , a0 ) + )] ≥ As the gradient direction is a local maximizer, there exists 1
maxa0 Q(s0 , a0 ) (Thrun & Schwartz, 1993). As a result, sufficiently small such that if α ≤ 1 then the approximate
even initially zero-mean error can cause value updates to value of πapprox will be bounded below by the approximate
result in a consistent overestimation bias, which is then prop- value of πtrue :
agated through the Bellman equation. This is problematic as
errors induced by function approximation are unavoidable. E [Qθ (s, πapprox (s))] ≥ E [Qθ (s, πtrue (s))] . (5)
While in the discrete action setting overestimation bias is Conversely, there exists 2 sufficiently small such that if
an obvious artifact from the analytical maximization, the α ≤ 2 then the true value of πapprox will be bounded above
presence and effects of overestimation bias is less clear in an by the true value of πtrue :
actor-critic setting where the policy is updated via gradient
descent. We begin by proving that the value estimate in de- E [Qπ (s, πtrue (s))] ≥ E [Qπ (s, πapprox (s))] . (6)
terministic policy gradients will be an overestimation under
some basic assumptions in Section 4.1 and then propose If in expectation the value estimate is at least as large as
a clipped variant of Double Q-learning in an actor-critic the true value with respect to φtrue , E [Qθ (s, πtrue (s))] ≥
setting to reduce overestimation bias in Section 4.2. E [Qπ (s, πtrue (s))], then Equations (5) and (6) imply that if
α < min(1 , 2 ), then the value estimate will be overesti-
mated:
4.1. Overestimation Bias in Actor-Critic
In actor-critic methods the policy is updated with respect E [Qθ (s, πapprox (s))] ≥ E [Qπ (s, πapprox (s))] . (7)
to the value estimates of an approximate critic. In this
section we assume the policy is updated using the deter- Although this overestimation may be minimal with each
ministic policy gradient, and show that the update induces update, the presence of error raises two concerns. Firstly, the
overestimation in the value estimate. Given current policy overestimation may develop into a more significant bias over
parameters φ, let φapprox define the parameters from the ac- many updates if left unchecked. Secondly, an inaccurate
tor update induced by the maximization of the approximate value estimate may lead to poor policy updates. This is
critic Qθ (s, a) and φtrue the parameters from the hypothet- particularly problematic because a feedback loop is created,
ical actor update with respect to the true underlying value in which suboptimal actions might be highly rated by the
function Qπ (s, a) (which is not known during learning): suboptimal critic, reinforcing the suboptimal action in the
next policy update.
α
φapprox = φ + Es∼pπ ∇φ πφ (s)∇a Qθ (s, a)|a=πφ (s) Does this theoretical overestimation occur in practice
Z1
α for state-of-the-art methods? We answer this question by
Es∼pπ ∇φ πφ (s)∇a Qπ (s, a)|a=πφ (s) ,
φtrue = φ + plotting the value estimate of DDPG (Lillicrap et al., 2015)
Z2
(4) over time while it learns on the OpenAI gym environments
where we assume Z1 and Z2 are chosen to normalize the Hopper-v1 and Walker2d-v1 (Brockman et al., 2016). In
gradient, i.e., such that Z −1 ||E[·]|| = 1. Without normal- Figure 1, we graph the average value estimate over 10000
ized gradients, overestimation bias is still guaranteed to states and compare it to an estimate of the true value. The
Addressing Function Approximation Error in Actor-Critic Methods
400
400
demonstrates that the actor-critic Double DQN suffers from
300 a similar overestimation as DDPG (as shown in Figure 1).
Average Value
300
200
While Double Q-learning is more effective, it does not en-
200
tirely eliminate the overestimation. We further show this
100
DQ-AC True DQ-AC 100 reduction is not sufficient experimentally in Section 6.1.
DDQN-AC True DDQN-AC
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 As πφ1 optimizes with respect to Qθ1 , using an indepen-
Time steps (1e6) Time steps (1e6)
dent estimate in the target update of Qθ1 would avoid the
(a) Hopper-v1 (b) Walker2d-v1 bias introduced by the policy update. However the critics
are not entirely independent, due to the use of the oppo-
Figure 2. Measuring overestimation bias in the value estimates of
actor critic variants of Double DQN (DDQN-AC) and Double Q-
site critic in the learning targets, as well as the same re-
learning (DQ-AC) on MuJoCo environments over 1 million time play buffer. As a result, for some states s we will have
steps. Qθ2 (s, πφ1 (s)) > Qθ1 (s, πφ1 (s)). This is problematic be-
cause Qθ1 (s, πφ1 (s)) will generally overestimate the true
value, and in certain areas of the state space the overestima-
true value is estimated using the average discounted return tion will be further exaggerated. To address this problem,
over 1000 episodes following the current policy, starting we propose to simply upper-bound the less biased value
from states sampled from the replay buffer. A very clear estimate Qθ2 by the biased estimate Qθ1 . This results in
overestimation bias occurs from the learning procedure, taking the minimum between the two estimates, to give the
which contrasts with the novel method that we describe in target update of our Clipped Double Q-learning algorithm:
the following section, Clipped Double Q-learning, which
greatly reduces overestimation by the critic. y1 = r + γ min Qθi0 (s0 , πφ1 (s0 )). (10)
i=1,2
4.2. Clipped Double Q-Learning for Actor-Critic With Clipped Double Q-learning, the value target cannot
introduce any additional overestimation over using the stan-
While several approaches to reducing overestimation bias
dard Q-learning target. While this update rule may induce
have been proposed, we find them ineffective in an actor-
an underestimation bias, this is far preferable to overesti-
critic setting. This section introduces a novel clipped variant
mation bias, as unlike overestimated actions, the value of
of Double Q-learning (Van Hasselt, 2010), which can re-
underestimated actions will not be explicitly propagated
place the critic in any actor-critic method.
through the policy update.
In Double Q-learning, the greedy update is disentangled
In implementation, computational costs can be reduced by
from the value function by maintaining two separate value
using a single actor optimized with respect to Qθ1 . We then
estimates, each of which is used to update the other. If the
use the same target y2 = y1 for Qθ2 . If Qθ2 > Qθ1 then
value estimates are independent, they can be used to make
the update is identical to the standard update and induces no
unbiased estimates of the actions selected using the opposite
additional bias. If Qθ2 < Qθ1 , this suggests overestimation
value estimate. In Double DQN (Van Hasselt et al., 2016),
has occurred and the value is reduced similar to Double Q-
the authors propose using the target network as one of the
learning. A proof of convergence in the finite MDP setting
value estimates, and obtain a policy by greedy maximization
follows from this intuition. We provide formal details and
of the current value network rather than the target network.
justification in the supplementary material.
In an actor-critic setting, an analogous update uses the cur-
rent policy rather than the target policy in the learning target: A secondary benefit is that by treating the function approxi-
mation error as a random variable we can see that the min-
y = r + γQθ0 (s0 , πφ (s0 )). (8) imum operator should provide higher value to states with
In practice however, we found that with the slow-changing lower variance estimation error, as the expected minimum
policy in actor-critic, the current and target networks were of a set of random variables decreases as the variance of
too similar to make an independent estimation, and offered the random variables increases. This effect means that the
little improvement. Instead, the original Double Q-learning minimization in Equation (10) will lead to a preference for
formulation can be used, with a pair of actors (πφ1 , πφ2 ) states with low-variance value estimates, leading to safer
and critics (Qθ1 , Qθ2 ), where πφ1 is optimized with respect policy updates with stable learning targets.
to Qθ1 and πφ2 with respect to Qθ2 :
5. Addressing Variance
y1 = r + γQθ20 (s0 , πφ1 (s0 ))
(9) While Section 4 deals with the contribution of variance to
y2 = r + γQθ10 (s0 , πφ2 (s0 )).
overestimation bias, we also argue that variance itself should
We measure the overestimation bias in Figure 2, which be directly addressed. Besides the impact on overestimation
Addressing Function Approximation Error in Actor-Critic Methods
Average Value
ton & Barto, 1998) as well as hurt performance in practice. 300
103
In this section we emphasize the importance of minimizing 250
2500 3000
6000 3000
2000
2000
4000 1500 2000
1000
2000 1000
1000
500 0
0
0 0
−1000
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time steps (1e6) Time steps (1e6) Time steps (1e6) Time steps (1e6)
−6
800 6000
700
−8 4000
600
−10 500 2000
400 0
−12
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time steps (1e6) Time steps (1e6) Time steps (1e6)
Figure 5. Learning curves for the OpenAI gym continuous control tasks. The shaded region represents half a standard deviation of the
average evaluation over 10 trials. Some graphs are cropped to display the interesting regions.
Table 1. Max Average Return over 10 trials of 1 million time steps. Maximum value for each task is bolded. ± corresponds to a single
standard deviation over trials.
Environment TD3 DDPG Our DDPG PPO TRPO ACKTR SAC
HalfCheetah 9636.95 ± 859.065 3305.60 8577.29 1795.43 -15.57 1450.46 2347.19
Hopper 3564.07 ± 114.74 2020.46 1860.02 2164.70 2471.30 2428.39 2996.66
Walker2d 4682.82 ± 539.64 1843.85 3098.11 3317.69 2321.47 1216.70 1283.67
Ant 4372.44 ± 1000.33 1005.30 888.77 1083.20 -75.85 1821.94 655.35
Reacher -3.60 ± 0.56 -6.51 -4.01 -6.18 -111.43 -4.26 -4.44
InvPendulum 1000.00 ± 0.00 1000.00 1000.00 1000.00 985.40 1000.00 1000.00
InvDoublePendulum 9337.47 ± 14.96 9355.52 8369.95 8977.94 205.85 9081.92 8487.15
6.1. Evaluation N (0, 0.2) to the actions chosen by the target actor network,
clipped to (−0.5, 0.5), delayed policy updates consists of
To evaluate our algorithm, we measure its performance on
only updating the actor and target critic network every d
the suite of MuJoCo continuous control tasks (Todorov et al.,
iterations, with d = 2. While a larger d would result in a
2012), interfaced through OpenAI Gym (Brockman et al.,
larger benefit with respect to accumulating errors, for fair
2016) (Figure 4). To allow for reproducible comparison, we
comparison, the critics are only trained once per time step,
use the original set of tasks from Brockman et al. (2016)
and training the actor for too few iterations would cripple
with no modifications to the environment or reward.
learning. Both target networks are updated with τ = 0.005.
For our implementation of DDPG (Lillicrap et al., 2015), we
To remove the dependency on the initial parameters of the
use a two layer feedforward neural network of 400 and 300
policy we use a purely exploratory policy for the first 10000
hidden nodes respectively, with rectified linear units (ReLU)
time steps of stable length environments (HalfCheetah-v1
between each layer for both the actor and critic, and a final
and Ant-v1) and the first 1000 time steps for the remaining
tanh unit following the output of the actor. Unlike the orig-
environments. Afterwards, we use an off-policy exploration
inal DDPG, the critic receives both the state and action as
strategy, adding Gaussian noise N (0, 0.1) to each action.
input to the first layer. Both network parameters are updated
Unlike the original implementation of DDPG, we used un-
using Adam (Kingma & Ba, 2014) with a learning rate of
correlated noise for exploration as we found noise drawn
10−3 . After each time step, the networks are trained with a
from the Ornstein-Uhlenbeck (Uhlenbeck & Ornstein, 1930)
mini-batch of a 100 transitions, sampled uniformly from a
process offered no performance benefits.
replay buffer containing the entire history of the agent.
Each task is run for 1 million time steps with evaluations
The target policy smoothing is implemented by adding ∼
every 5000 time steps, where each evaluation reports the
Addressing Function Approximation Error in Actor-Critic Methods
6.2. Ablation Studies outperforms both prior methods, this suggests that subdu-
We perform ablation studies to understand the contribution ing the overestimations from the unbiased estimator is an
of each individual component: Clipped Double Q-learning effective measure to improve performance.
(Section 4.2), delayed policy updates (Section 5.2) and target
policy smoothing (Section 5.3). We present our results in 7. Conclusion
Table 2 in which we compare the performance of removing
each component from TD3 along with our modifications to Overestimation has been identified as a key problem in
the architecture and hyper-parameters. Additional learning value-based methods. In this paper, we establish overesti-
curves can be found in the supplementary material. mation bias is also problematic in actor-critic methods. We
find the common solutions for reducing overestimation bias
The significance of each component varies task to task. in deep Q-learning with discrete actions are ineffective in an
While the addition of only a single component causes in- actor-critic setting, and develop a novel variant of Double
significant improvement in most cases, the addition of com- Q-learning which limits possible overestimation. Our re-
binations performs at a much higher level. The full algo- sults demonstrate that mitigating overestimation can greatly
rithm outperforms every other combination in most tasks. improve the performance of modern algorithms.
Although the actor is trained for only half the number of
iterations, the inclusion of delayed policy update generally Due to the connection between noise and overestimation,
improves performance, while reducing training time. we examine the accumulation of errors from temporal dif-
ference learning. Our work investigates the importance of
We additionally compare the effectiveness of the actor-critic a standard technique in deep reinforcement learning, target
variants of Double Q-learning (Van Hasselt, 2010) and Dou- networks, and examines their role in limiting errors from
ble DQN (Van Hasselt et al., 2016), denoted DQ-AC and imprecise function approximation and stochastic optimiza-
DDQN-AC respectively, in Table 2. For fairness in com- tion. Finally, we introduce a SARSA-style regularization
parison, these methods also benefited from delayed policy technique which modifies the temporal difference target to
updates, target policy smoothing and use our architecture bootstrap off similar state-action pairs.
and hyper-parameters. Both methods were shown to reduce
overestimation bias less than Clipped Double Q-learning in Taken together, these improvements define our proposed
Section 4. This is reflected empirically, as both methods approach, the Twin Delayed Deep Deterministic policy gra-
result in insignificant improvements over TD3 - CDQ, with dient algorithm (TD3), which greatly improves both the
an exception in the Ant-v1 environment, which appears to learning speed and performance of DDPG in a number of
benefit greatly from any overestimation reduction. As the challenging tasks in the continuous control setting. Our
inclusion of Clipped Double Q-learning into our full method algorithm exceeds the performance of numerous state of
the art algorithms. As our modifications are simple to im-
1
See the supplementary material for hyper-parameters and a plement, they can be easily added to any other actor-critic
discussion on the discrepancy in the reported results of SAC. algorithm.
Addressing Function Approximation Error in Actor-Critic Methods
Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, Lee, D., Defourny, B., and Powell, W. B. Bias-corrected
W., Horgan, D., TB, D., Muldal, A., Heess, N., and Lil- q-learning to control max-operator bias in q-learning.
licrap, T. Distributional policy gradients. International In Adaptive Dynamic Programming And Reinforcement
Conference on Learning Representations, 2018. Learning (ADPRL), 2013 IEEE Symposium on, pp. 93–99.
IEEE, 2013.
Bellemare, M. G., Dabney, W., and Munos, R. A distribu-
tional perspective on reinforcement learning. In Interna- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
tional Conference on Machine Learning, pp. 449–458, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous
2017. control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.
Bellman, R. Dynamic Programming. Princeton University
Lin, L.-J. Self-improving reactive agents based on reinforce-
Press, 1957.
ment learning, planning and teaching. Machine learning,
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., 8(3-4):293–321, 1992.
Schulman, J., Tang, J., and Zaremba, W. Openai gym, Mannor, S. and Tsitsiklis, J. N. Mean-variance optimization
2016. in markov decision processes. In International Confer-
ence on Machine Learning, pp. 177–184, 2011.
Dhariwal, P., Hesse, C., Plappert, M., Radford, A., Schul-
man, J., Sidor, S., and Wu, Y. Openai baselines. https: Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. N. Bias
//github.com/openai/baselines, 2017. and variance approximation in value function estimates.
Management Science, 53(2):308–322, 2007.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,
V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
I., et al. Impala: Scalable distributed deep-rl with impor- J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
tance weighted actor-learner architectures. arXiv preprint land, A. K., Ostrovski, G., et al. Human-level control
arXiv:1802.01561, 2018. through deep reinforcement learning. Nature, 518(7540):
529–533, 2015.
Fox, R., Pakman, A., and Tishby, N. Taming the noise in
reinforcement learning via soft updates. In Proceedings of Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
the Thirty-Second Conference on Uncertainty in Artificial T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-
Intelligence, pp. 202–211. AUAI Press, 2016. chronous methods for deep reinforcement learning. In
International Conference on Machine Learning, pp. 1928–
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft 1937, 2016.
actor-critic: Off-policy maximum entropy deep reinforce-
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare,
ment learning with a stochastic actor. arXiv preprint
M. Safe and efficient off-policy reinforcement learning.
arXiv:1801.01290, 2018.
In Advances in Neural Information Processing Systems,
He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning pp. 1054–1062, 2016.
to play in a day: Faster deep reinforcement learning by Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D.
optimality tightening. arXiv preprint arXiv:1611.01606, Smoothed action value functions for learning gaussian
2016. policies. arXiv preprint arXiv:1803.02348, 2018.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, O’Donoghue, B., Osband, I., Munos, R., and Mnih, V. The
D., and Meger, D. Deep Reinforcement Learning that uncertainty bellman equation and exploration. arXiv
Matters. arXiv preprint arXiv:1709.06560, 2017. preprint arXiv:1709.05380, 2017.
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, Pendrith, M. D., Ryan, M. R., et al. Estimator variance in
M., van Hasselt, H., and Silver, D. Distributed prioritized reinforcement learning: Theoretical problems and practi-
experience replay. International Conference on Learning cal solutions. University of New South Wales, School of
Representations, 2018. Computer Science and Engineering, 1997.
Addressing Function Approximation Error in Actor-Critic Methods
Petrik, M. and Scherrer, B. Biasing approximate dynamic Van Seijen, H., Van Hasselt, H., Whiteson, S., and Wiering,
programming with a lower discount factor. In Advances in M. A theoretical and empirical analysis of expected sarsa.
Neural Information Processing Systems, pp. 1265–1272, In Adaptive Dynamic Programming and Reinforcement
2009. Learning, 2009. ADPRL’09. IEEE Symposium on, pp.
177–184. IEEE, 2009.
Popov, I., Heess, N., Lillicrap, T., Hafner, R., Barth-
Maron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez, Watkins, C. J. C. H. Learning from delayed rewards. PhD
T., and Riedmiller, M. Data-efficient deep reinforce- thesis, King’s College, Cambridge, 1989.
ment learning for dexterous manipulation. arXiv preprint
arXiv:1704.03073, 2017. Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba,
J. Scalable trust-region method for deep reinforcement
Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy learning using kronecker-factored approximation. In Ad-
temporal-difference learning with function approxima- vances in Neural Information Processing Systems, pp.
tion. In International Conference on Machine Learning, 5285–5294, 2017.
pp. 417–424, 2001.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,
P. Trust region policy optimization. In International
Conference on Machine Learning, pp. 1889–1897, 2015.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and
Riedmiller, M. Deterministic policy gradient algorithms.
In International Conference on Machine Learning, pp.
387–395, 2014.