Human-Level Control Through Deep Reinforcement Learning
Human-Level Control Through Deep Reinforcement Learning
doi:10.1038/nature14236
which is the maximum sum of rewards rt discounted by c at each timestep t, achievable by a behaviour policy p 5 P(ajs), after making an
observation (s) and taking an action (a) (see Methods)19.
Reinforcement learning is known to be unstable or even to diverge
when a nonlinear function approximator such as a neural network is
used to represent the action-value (also known as Q) function20. This
instability has several causes: the correlations present in the sequence
of observations, the fact that small updates to Q may significantly change
the policy and therefore change the data distribution, and the correlations
between the action-values (Q) and the target values rzc max
Qs0 , a0 .
a0
We address these instabilities with a novel variant of Q-learning, which
uses two key ideas. First, we used a biologically inspired mechanism
termed experience replay2123 that randomizes over the data, thereby
removing correlations in the observation sequence and smoothing over
changes in the data distribution (see below for details). Second, we used
an iterative update that adjusts the action-values (Q) towards target
values that are only periodically updated, thereby reducing correlations
with the target.
While other stable methods exist for training neural networks in the
reinforcement learning setting, such as neural fitted Q-iteration24, these
methods involve the repeated training of networks de novo on hundreds
of iterations. Consequently, these methods, unlike our algorithm, are
too inefficient to be used successfully with large neural networks. We
parameterize an approximate value function Q(s,a;hi) using the deep
convolutional neural network shown in Fig. 1, in which hi are the parameters (that is, weights) of the Q-network at iteration i. To perform
experience replay we store the agents experiences et 5 (st,at,rt,st 1 1)
at each time-step t in a data set Dt 5 {e1,,et}. During learning, we
apply Q-learning updates, on samples (or minibatches) of experience
(s,a,r,s9) , U(D), drawn uniformly at random from the pool of stored
samples. The Q-learning update at iteration i uses the following loss
function:
"
#
2
Li hi ~
s,a,r,s0 *UD
rzc max
Q(s0 ,a0 ; h{
i ){Qs,a; hi
0
a
RESEARCH LETTER
Convolution
Convolution
Fully connected
Fully connected
No input
symbolizes sliding of each filter across input image) and two fully connected
layers with a single output for each valid action. Each hidden layer is followed
by a rectifier nonlinearity (that is, max0,x).
difficult and engaging for human players. We used the same network
architecture, hyperparameter values (see Extended Data Table 1) and
learning procedure throughouttaking high-dimensional data (210|160
colour video at 60 Hz) as inputto demonstrate that our approach
robustly learns successful policies over a variety of games based solely
on sensory inputs with only very minimal prior knowledge (that is, merely
the input data were visual images, and the number of actions available
in each game, but not their correspondences; see Methods). Notably,
our method was able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner
illustrated by the temporal evolution of two indices of learning (the
agents average score-per-episode and average predicted Q-values; see
Fig. 2 and Supplementary Discussion for details).
2,200
2,000
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
b
Average score per episode
6,000
5,000
4,000
3,000
2,000
1,000
0
Training epochs
10
9
8
7
6
5
4
3
2
1
0
Training epochs
d
Average action value (Q)
Figure 2 | Training curves tracking the agents average score and average
predicted action-value. a, Each point is the average score achieved per episode
after the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on Space
Invaders. b, Average score achieved per episode for Seaquest. c, Average
predicted action-value on a held-out set of states on Space Invaders. Each point
11
10
9
8
7
6
5
4
3
2
1
0
on the curve is the average of the action-value Q computed over the held-out
set of states. Note that Q-values are scaled due to clipping of rewards (see
Methods). d, Average predicted action-value on Seaquest. See Supplementary
Discussion for details.
5 3 0 | N AT U R E | VO L 5 1 8 | 2 6 F E B R U A RY 2 0 1 5
LETTER RESEARCH
Video Pinball
Boxing
Breakout
Star Gunner
Robotank
Atlantis
Crazy Climber
Gopher
Demon Attack
Name This Game
Krull
Assault
Road Runner
Kangaroo
James Bond
Tennis
Pong
Space Invaders
Beam Rider
Tutankham
Kung-Fu Master
Freeway
Time Pilot
Enduro
Fishing Derby
Up and Down
Ice Hockey
Q*bert
H.E.R.O.
Asterix
Battle Zone
Wizard of Wor
Chopper Command
Centipede
Bank Heist
River Raid
Zaxxon
Amidar
Alien
Venture
Seaquest
Double Dunk
Bowling
Ms. Pac-Man
Asteroids
Frostbite
Gravitar
Private Eye
Montezuma's Revenge
At human-level or above
Below human-level
DQN
Best linear learner
0
100
200
300
400
500
600
1,000
4,500%
outperforms competing methods (also see Extended Data Table 2) in almost all
the games, and performs at a level that is broadly comparable with or superior
to a professional human games tester (that is, operationalized as a level of
75% or above) in the majority of games. Audio output was disabled for both
human players and agents. Error bars indicate s.d. across the 30 evaluation
episodes, starting with different initial conditions.
perceptually dissimilar (Fig. 4, bottom right, top left and middle), consistent with the notion that the network is able to learn representations
that support adaptive behaviour from high-dimensional sensory inputs.
Furthermore, we also show that the representations learned by DQN
are able to generalize to data generated from policies other than its
ownin simulations where we presented as input to the network game
states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary
Discussion). Extended Data Fig. 2 provides an additional illustration of
how the representations learned by DQN allow it to accurately predict
state and action values.
It is worth noting that the games in which DQN excels are extremely
varied in their nature, from side-scrolling shooters (River Raid) to boxing games (Boxing) and three-dimensional car-racing games (Enduro).
2 6 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 5 3 1
RESEARCH LETTER
V
predicts high state values for both full (top right screenshots) and nearly
complete screens (bottom left screenshots) because it has learned that
completing a screen leads to a new screen full of enemy ships. Partially
completed screens (bottom screenshots) are assigned lower state values because
less immediate reward is available. The screens shown on the bottom right
and top left and middle are less perceptually similar than the other examples but
are still mapped to nearby representations and similar values because the
orange bunkers do not carry great significance near the end of a level. With
permission from Square Enix Limited.
realization of such a process in the mammalian brain, with the timecompressed reactivation of recently experienced trajectories during
offline periods21,22 (for example, waking rest) providing a putative mechanism by which value functions may be efficiently updated through
interactions with the basal ganglia22. In the future, it will be important
to explore the potential use of biasing the content of experience replay
towards salient events, a phenomenon that characterizes empirically
observed hippocampal replay29, and relates to the notion of prioritized
sweeping30 in reinforcement learning. Taken together, our work illustrates the power of harnessing state-of-the-art machine learning techniques with biologically inspired mechanisms to create agents that are
capable of learning to master a diverse array of challenging tasks.
Online Content Methods, along with any additional Extended Data display items
and Source Data, are available in the online version of the paper; references unique
to these sections appear only in the online paper.
Received 10 July 2014; accepted 16 January 2015.
1.
2.
3.
4.
5.
5 3 2 | N AT U R E | VO L 5 1 8 | 2 6 F E B R U A RY 2 0 1 5
LETTER RESEARCH
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23. Lin, L.-J. Reinforcement learning for robots using neural networks. Technical
Report, DTIC Document (1993).
24. Riedmiller, M. Neural fitted Q iteration - first experiences with a data efficient
neural reinforcement learning method. Mach. Learn.: ECML, 3720, 317328
(Springer, 2005).
25. Van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using
t-SNE. J. Mach. Learn. Res. 9, 25792605 (2008).
26. Lange, S. & Riedmiller, M. Deep auto-encoder neural networks in reinforcement
learning. Proc. Int. Jt. Conf. Neural. Netw. 18 (2010).
27. Law, C.-T. & Gold, J. I. Reinforcement learning can account for associative
and perceptual learning on a visual decision task. Nature Neurosci. 12, 655
(2009).
28. Sigala, N. & Logothetis, N. K. Visual categorization shapes feature selectivity in the
primate temporal cortex. Nature 415, 318320 (2002).
29. Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during sleep.
Nature Neurosci. 15, 14391444 (2012).
30. Moore, A. & Atkeson, C. Prioritized sweeping: reinforcement learning with less data
and less real time. Mach. Learn. 13, 103130 (1993).
Supplementary Information is available in the online version of the paper.
Acknowledgements We thank G. Hinton, P. Dayan and M. Bowling for discussions,
A. Cain and J. Keene for work on the visuals, K. Keller and P. Rogers for help with the
visuals, G. Wayne for comments on an earlier version of the manuscript, and the rest of
the DeepMind team for their support, ideas and encouragement.
Author Contributions V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H.
conceptualized the problem and the technical framework. V.M., K.K., A.A.R. and D.S.
developed and tested the algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and
A.S. created the testing platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K.,
D.H., V.M., D.S., A.G., A.A.R., J.V. and M.G.B. wrote the paper.
Author Information Reprints and permissions information is available at
www.nature.com/reprints. The authors declare no competing financial interests.
Readers are welcome to comment on the online version of the paper. Correspondence
and requests for materials should be addressed to K.K. ([email protected]) or
D.H. ([email protected]).
2 6 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 5 3 3
RESEARCH LETTER
METHODS
Preprocessing. Working directly with raw Atari 2600 frames, which are 210 3 160
pixel images with a 128-colour palette, can be demanding in terms of computation
and memory requirements. We apply a basic preprocessing step aimed at reducing
the input dimensionality and dealing with some artefacts of the Atari 2600 emulator. First, to encode a single frame we take the maximum value for each pixel colour
value over the frame being encoded and the previous frame. This was necessary to
remove flickering that is present in games where some objects appear only in even
frames while other objects appear only in odd frames, an artefact caused by the
limited number of sprites Atari 2600 can display at once. Second, we then extract
the Y channel, also known as luminance, from the RGB frame and rescale it to
84 3 84. The function w from algorithm 1 described below applies this preprocessing to the m most recent frames and stacks them to produce the input to the
Q-function, in which m 5 4, although the algorithm is robust to different values of
m (for example, 3 or 5).
Code availability. The source code can be accessed at https://sites.google.com/a/
deepmind.com/dqn for non-commercial uses only.
Model architecture. There are several possible ways of parameterizing Q using a
neural network. Because Q maps historyaction pairs to scalar estimates of their
Q-value, the history and the action have been used as inputs to the neural network
by some previous approaches24,26. The main drawback of this type of architecture
is that a separate forward pass is required to compute the Q-value of each action,
resulting in a cost that scales linearly with the number of actions. We instead use an
architecture in which there is a separate output unit for each possible action, and
only the state representation is an input to the neural network. The outputs correspond to the predicted Q-values of the individual actions for the input state. The
main advantage of this type of architecture is the ability to compute Q-values for all
possible actions in a given state with only a single forward pass through the network.
The exact architecture, shown schematically in Fig. 1, is as follows. The input to
the neural network consists of an 84 3 84 3 4 image produced by the preprocessing map w. The first hidden layer convolves 32 filters of 8 3 8 with stride 4 with the
input image and applies a rectifier nonlinearity31,32. The second hidden layer convolves 64 filters of 4 3 4 with stride 2, again followed by a rectifier nonlinearity.
This is followed by a third convolutional layer that convolves 64 filters of 3 3 3 with
stride 1 followed by a rectifier. The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a
single output for each valid action. The number of valid actions varied between 4
and 18 on the games we considered.
Training details. We performed experiments on 49 Atari 2600 games where results
were available for all other comparable methods12,15. A different network was trained
on each game: the same network architecture, learning algorithm and hyperparameter settings (see Extended Data Table 1) were used across all games, showing that
our approach is robust enough to work on a variety of games while incorporating
only minimal prior knowledge (see below). While we evaluated our agents on unmodified games, we made one change to the reward structure of the games during training
only. As the scale of scores varies greatly from game to game, we clipped all positive rewards at 1 and all negative rewards at 21, leaving 0 rewards unchanged.
Clipping the rewards in this manner limits the scale of the error derivatives and
makes it easier to use the same learning rate across multiple games. At the same time,
it could affect the performance of our agent since it cannot differentiate between
rewards of different magnitude. For games where there is a life counter, the Atari
2600 emulator also sends the number of lives left in the game, which is then used to
mark the end of an episode during training.
In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/
,tijmen/csc321/slides/lecture_slides_lec6.pdf ) algorithm with minibatches of size
32. The behaviour policy during training was e-greedy with e annealed linearly
from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained
for a total of 50 million frames (that is, around 38 days of game experience in total)
and used a replay memory of 1 million most recent frames.
Following previous approaches to playing Atari 2600 games, we also use a simple
frame-skipping technique15. More precisely, the agent sees and selects actions on
every kth frame instead of every frame, and its last action is repeated on skipped
frames. Because running the emulator forward for one step requires much less
computation than having the agent select an action, this technique allows the agent
to play roughly k times more games without significantly increasing the runtime.
We use k 5 4 for all games.
The values of all the hyperparameters and optimization parameters were selected
by performing an informal search on the games Pong, Breakout, Seaquest, Space
Invaders and Beam Rider. We did not perform a systematic grid search owing to
the high computational cost. These parameters were then held fixed across all other
games. The values and descriptions of all hyperparameters are provided in Extended
Data Table 1.
Our experimental setup amounts to using the following minimal prior knowledge: that the input data consisted of visual images (motivating our use of a convolutional deep network), the game-specific score (with no modification), number
of actions, although not their correspondences (for example, specification of the
up button) and the life count.
Evaluation procedure. The trained agents were evaluated by playing each game
30 times for up to 5 min each time with different initial random conditions (noop; see Extended Data Table 1) and an e-greedy policy with e 5 0.05. This procedure is adopted to minimize the possibility of overfitting during evaluation. The
random agent served as a baseline comparison and chose a random action at 10 Hz
which is every sixth frame, repeating its last action on intervening frames. 10 Hz is
about the fastest that a human player can select the fire button, and setting the
random agent to this frequency avoids spurious baseline scores in a handful of the
games. We did also assess the performance of a random agent that selected an action
at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized
DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy
Climber, Demon Attack, Krull and Robotank), and in all these games DQN outperformed the expert human by a considerable margin.
The professional human tester used the same emulator engine as the agents, and
played under controlled conditions. The human tester was not allowed to pause,
save or reload games. As in the original Atari 2600 environment, the emulator was
run at 60 Hz and the audio output was disabled: as such, the sensory input was
equated between human player and agents. The human performance is the average
reward achieved from around 20 episodes of each game lasting a maximum of 5 min
each, following around 2 h of practice playing each game.
Algorithm. We consider tasks in which an agent interacts with an environment,
in this case the Atari emulator, in a sequence of actions, observations and rewards.
At each time-step the agent selects an action at from the set of legal game actions,
A~f1, . . . ,K g. The action is passed to the emulator and modifies its internal state
and the game score. In general the environment may be stochastic. The emulators
internal state is not observed by the agent; instead the agent observes an image
xt [Rd from the emulator, which is a vector of pixel values representing the current
screen. In addition it receives a reward rt representing the change in game score.
Note that in general the game score may depend on the whole previous sequence of
actions and observations; feedback about an action may only be received after many
thousands of time-steps have elapsed.
Because the agent only observes the current screen, the task is partially observed33
and many emulator states are perceptually aliased (that is, it is impossible to fully
understand the current situation from only the current screen xt ). Therefore,
sequences of actions and observations, st ~x1 ,a1 ,x2 ,:::,at{1 ,xt , are input to the
algorithm, which then learns game strategies depending upon these sequences. All
sequences in the emulator are assumed to terminate in a finite number of timesteps. This formalism gives rise to a large but finite Markov decision process (MDP)
in which each sequence is a distinct state. As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence st
as the state representation at time t.
The goal of the agent is to interact with the emulator by selecting actions in a way
that maximizes future rewards. We make the standard assumption that future rewards
are discounted by a factor of c per time-step (c was set to 0.99 throughout), and
T
X
0
ct {t rt 0 , in which T is the
define the future discounted return at time t as Rt ~
t 0 ~t
s
,a
Ds,a
Q s,a ~ s0 rzc max
0
a
LETTER RESEARCH
sometimes a nonlinear function approximator is used instead, such as a neural
network. We refer to a neural network function approximator with weights h as a
Q-network. A Q-network can be trained by adjusting the parameters hi at iteration
i to reduce the mean-squared error in the Bellman equation, where the optimal
target values rzc maxa0 Q s0 ,a0 are substituted with approximate target values
{
y~rzc maxa0 Q s0 ,a0 ; h{
i , using parameters hi from some previous iteration.
This leads to a sequence of loss functions Li(hi) that changes at each iteration i,
Li hi ~ s,a,r Es0 yDs,a{Qs,a; hi 2
~ s,a,r,s0 y{Qs,a; hi 2 zEs,a,r Vs0 y:
Note that the targets depend on the network weights; this is in contrast with the
targets used for supervised learning, which are fixed before learning begins. At
each stage of optimization, we hold the parameters from the previous iteration hi2
fixed when optimizing the ith loss function Li(hi), resulting in a sequence of welldefined optimization problems. The final term is the variance of the targets, which
does not depend on the parameters hi that we are currently optimizing, and may
therefore be ignored. Differentiating the loss function with respect to the weights
we arrive at the following gradient:
Q s0 ,a0 ; h{
+hi Lhi ~ s,a,r,s0 rzc max
i {Qs,a; hi +hi Qs,a; hi :
0
a
Rather than computing the full expectations in the above gradient, it is often
computationally expedient to optimize the loss function by stochastic gradient
descent. The familiar Q-learning algorithm19 can be recovered in this framework
by updating the weights after every time step, replacing the expectations using
single samples, and setting h{
i ~hi{1 .
Note that this algorithm is model-free: it solves the reinforcement learning task
directly using samples from the emulator, without explicitly estimating the reward
and transition dynamics Pr,s0 Ds,a. It is also off-policy: it learns about the greedy
policy a~argmaxa0 Qs,a0 ; h, while following a behaviour distribution that ensures
adequate exploration of the state space. In practice, the behaviour distribution is
often selected by an e-greedy policy that follows the greedy policy with probability
1 2 e and selects a random action with probability e.
Training algorithm for deep Q-networks. The full algorithm for training deep
Q-networks is presented in Algorithm 1. The agent selects and executes actions
according to an e-greedy policy based on Q. Because using histories of arbitrary
length as inputs to a neural network can be difficult, our Q-function instead works
on a fixed length representation of histories produced by the function w described
above. The algorithm modifies standard online Q-learning in two ways to make it
suitable for training large neural networks without diverging.
First, we use a technique known as experience replay23 in which we store the
agents experiences at each time-step, et 5 (st, at, rt, st 1 1), in a data set Dt 5 {e1,,et},
pooled over many episodes (where the end of an episode occurs when a terminal state is reached) into a replay memory. During the inner loop of the algorithm,
we apply Q-learning updates, or minibatch updates, to samples of experience,
(s, a, r, s9) , U(D), drawn at random from the pool of stored samples. This approach
has several advantages over standard online Q-learning. First, each step of experience
is potentially used in many weight updates, which allows for greater data efficiency.
Second, learning directly from consecutive samples is inefficient, owing to the strong
correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning onpolicy the current parameters determine the next data sample that the parameters
are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch.
It is easy to see how unwanted feedback loops may arise and the parameters could get
stuck in a poor local minimum, or even diverge catastrophically20. By using experience
replay the behaviour distribution is averaged over many of its previous states,
smoothing out learning and avoiding oscillations or divergence in the parameters.
Note that when learning by experience replay, it is necessary to learn off-policy
(because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.
In practice, our algorithm only stores the last N experience tuples in the replay
memory, and samples uniformly at random from D when performing updates. This
approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing
to the finite memory size N. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory. A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to
prioritized sweeping30.
The second modification to online Q-learning aimed at further improving the
stability of our method with neural networks is to use a separate network for generating the targets yj in the Q-learning update. More precisely, every C updates we
^ and use Q
^ for generating the
clone the network Q to obtain a target network Q
Q-learning targets yj for the following C updates to Q. This modification makes the
algorithm more stable compared to standard online Q-learning, where an update
that increases Q(st,at) often also increases Q(st 1 1,a) for all a and hence also increases
the target yj, possibly leading to oscillations or divergence of the policy. Generating
the targets using an older set of parameters adds a delay between the time an update
to Q is made and the time the update affects the targets yj, making divergence or
oscillations much more unlikely.
also found it helpful to clip the error term from the update rzc maxa0 Q
0We
s ,a0 ; h{
i {Qs,a; hi to be between 21 and 1. Because the absolute value loss
function jxj has a derivative of 21 for all negative values of x and a derivative of 1
for all positive values of x, clipping the squared error to be between 21 and 1 corresponds to using an absolute value loss function for errors outside of the (21,1)
interval. This form of error clipping further improved the stability of the algorithm.
Algorithm 1: deep Q-learning with experience replay.
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights h
^ with weights h2 5 h
Initialize target action-value function Q
For episode 5 1, M do
Initialize sequence s1 ~fx1 g and preprocessed sequence w1 ~ws1
For t 5 1,T do
With probability e select a random action at
otherwise select at ~argmaxa Qwst ,a; h
Execute action at in emulator and observe reward rt and image xt 1 1
Set stz1 ~st ,at ,xtz1 and preprocess
wtz1 ~wstz1
Store transition wt ,at ,rt ,wtz1 in D
Sample random minibatch of transitions wj ,aj ,rj ,wjz1 from D
(
r
j
if episode terminates at step jz1
Set yj ~
^ wjz1 ,a0 ; h{
otherwise
rj zc maxa0 Q
2
Perform a gradient descent step on yj {Q wj ,aj ; h
with respect to the
network parameters h
^
Every C steps reset Q~Q
End For
End For
31.
32.
33.
Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y. What is the best multi-stage
architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 21462153
(2009).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann
machines. Proc. Int. Conf. Mach. Learn. 807814 (2010).
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially
observable stochastic domains. Artificial Intelligence 101, 99134 (1994).
RESEARCH LETTER
points) and DQN play (blue points) suggests that the representations learned
by DQN do indeed generalize to data generated from policies other than its
own. The presence in the t-SNE embedding of overlapping clusters of points
corresponding to the network representation of states experienced during
human and agent play shows that the DQN agent also follows sequences of
states similar to those found in human play. Screenshots corresponding to
selected states are shown (human: orange border; DQN: blue border).
LETTER RESEARCH
all actions are around 0.7, reflecting the expected value of this state based on
previous experience. At time point 2, the agent starts moving the paddle
towards the ball and the value of the up action stays high while the value of the
down action falls to 20.9. This reflects the fact that pressing down would lead
to the agent losing the ball and incurring a reward of 21. At time point 3,
the agent hits the ball by pressing up and the expected reward keeps increasing
until time point 4, when the ball reaches the left edge of the screen and the value
of all actions reflects that the agent is about to receive a reward of 1. Note,
the dashed line shows the past trajectory of the ball purely for illustrative
purposes (that is, not shown during the game). With permission from Atari
Interactive, Inc.
RESEARCH LETTER
Extended Data Table 1 | List of hyperparameters and their values
The values of all the hyperparameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing
to the high computational cost, although it is conceivable that even better results could be obtained by systematically tuning the hyperparameter values.
LETTER RESEARCH
Extended Data Table 2 | Comparison of games scores obtained by DQN agents with methods from the literature12,15 and a professional
human games tester
Best Linear Learner is the best result obtained by a linear function approximator on different types of hand designed features12. Contingency (SARSA) agent figures are the results obtained in ref. 15. Note the
figures in the last column indicate the performance of DQN relative to the human games tester, expressed as a percentage, that is, 100 3 (DQN score 2 random play score)/(human score 2 random play score).
RESEARCH LETTER
Extended Data Table 3 | The effects of replay and separating the target Q-network
DQN agents were trained for 10 million frames using standard hyperparameters for all possible combinations of turning replay on or off, using or not using a separate target Q-network, and three different learning
rates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 min
leading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented in
Extended Data Table 2 (50 million frames).
LETTER RESEARCH
Extended Data Table 4 | Comparison of DQN performance with linear function approximator
The performance of the DQN agent is compared with the performance of a linear function approximator
on the 5 validation games (that is, where a single linear layer was used instead of the convolutional
network, in combination with replay and separate target network). Agents were trained for 10 million
frames using standard hyperparameters, and three different learning rates. Each agent was evaluated
every 250,000 training frames for 135,000 validation frames and the highest average episode score is
reported. Note that these evaluation episodes were not truncated at 5 min leading to higher scores on
Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames
was shorter (10 million frames) as compared to the main results presented in Extended Data Table 2
(50 million frames).