DQN Muhammed
DQN Muhammed
reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver et. al.
Google Deepmind
Presented by
Muhammed Kocabas
●
Introduction
●
What is Reinforcement Learning?
●
QLearning
●
Background and related work
●
Methods
●
Experiments
●
Results
●
Conclusion
●
References
2
Introduction
●
Former reinforcement learning agents are successful in some domains in which useful
features can be handcrafted, or in fully observed, low dimensional state spaces.
●
Authors used the recent advances in deep neural network training, and developed
DeepQNetwork.
●
DeepQNetwork:
– Learns successful policies
– Endtoend RL method.
– In Atari 2600 games, receives only the pixels and the game score as inputs just like a
human learns and perform superhuman capabilities in half of the games.
●
Comparison: Human players, linear learners
3
What is reinforcement learning?
Types of Machine Learning
4
What is reinforcement learning?
5
What is reinforcement learning?
●
No explicit training data set.
●
Nature provides reward for each of the learners actions.
●
At each time,
– Learner has a state and choses an action.
– Nature responds with new state and a reward.
– Learner learns from reward and makes better decisions.
6
What is reinforcement learning?
●
Main goal is to maximizing the reward Rt
●
Looking at only immediate rewards wouldn’t work well.
●
We need to take into account “future” rewards.
● At time t, the total future reward is:
Rt = rt + rt+1 + … + rn
●
We want to take the action that maximizes Rt
●
But we have to consider the fact that:
The environment is stochastic
7
Discounted future rewards
●
We can never be sure, if we will get the same rewards the next time we
perform the same actions. The more into the future we go, the more it may
diverge.
●
Main goal is to maximizing the reward Rt
Rt = rt + ° rt+1 + ° 2 rt+2 + … + ° n-t rn
●
° is the discount factor between 0 and 1.
●
The more into the future the reward is, the less we take it into consideration.
●
Simply:
Rt = rt + ° (rt+1 + ° (rt+2 + … ))
Rt = rt + ° Rt+1
8
QFunction
●
The maximum discounted future reward when we perform action a in state
s, and continue optimally from that point on.
●
The way to think about Q(st, at) is that it is “the best possible score at the end
of the game after performing action a in state s”.
●
It is called Qfunction, because it represents the “quality” of a certain action
in a given state.
9
Policy & Bellman equation
●
¼ represents the policy, the rule how we choose an action in each state.
●
This is called the Bellman equation.
●
Maximum future reward for this state and action is the immediate reward plus
maximum future reward for the next state.
●
The basic idea behind many reinforcement learning algorithms is to estimate the
actionvalue function by using the Bellman equation as an iterative update
10
QLearning
●
α in the algorithm is a learning rate that controls how much of the difference
between previous Qvalue and newly proposed Qvalue is taken into account.
● The maxa´Q(s´, a´) that we use to update Q(s, a) is only an approximation
and in early stages of learning it may be completely wrong. However the
approximation get more and more accurate with every iteration and it has been
shown, that if we perform this update enough times, then the Qfunction will
converge and represent the true Qvalue.
11
Related Work
●
Neural fitted Qiteration (2005)
– Riedmiller, M. Neural fitted Q iteration first experiences with a data efficient neural
reinforcement learning method. Mach. Learn.: ECML
●
Deep autoencoder NN in RL (2010)
– Lange, S. & Riedmiller, M. Deep autoencoder neural networks in reinforcement learning.
Proc. Int. Jt. Conf. Neural. Netw.
Comparison Papers:
●
The arcade learning environment: An evaluation platform for general agents (2013)
– Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment:
An evaluation platform for general agents. J. Artif. Intell. Res.
●
Investigating contingency awareness using Atari 2600 games (2012)
– Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness using
Atari 2600 games. Proc. Conf. AAAI. Artif. Intell.
12
Neural Fitted QIteration
●
This paper introduces NFQ, an algorithm for efficient and effective training of
a Qvalue function represented by a multilayer perceptron.
●
The main drawback of this type of architecture is that a separate forward
pass is required to compute the Qvalue of each action, resulting in a cost
that scales linearly with the number of actions.
●
This method involve the repeated training of networks de novo on hundreds
of iterations.
13
●
This paper tries to solve visual reinforcement learning problems e.g. simple
mazes.
●
Propose to use deep autoencoder nn’s to to obtain low dimensional feature
space by extracting representative features from states.
●
Then, uses kernel based approximators e.g. FQI (Fitted Q Iterations) to
approximate Qfunction over feature vectors outputted by DAE.
15
16
Deep QLearning
●
The basic idea behind many reinforcement learning algorithms is to estimate
the actionvalue function by using the Bellman equation as an iterative
update.
Qi+1(s, a) = r + ° maxa´Qi(s´, a´)
●
Such value iteration algorithms converge to the optimal actionvalue
function, Qi+1 → Q* as i →
8
●
In practice, this basic approach is impractical, because the actionvalue
function is estimated separately for each sequence, without any
generalization.
●
It is common to use a function approximators to estimate the actionvalue
function as linear or nonlinear function approximators e.g. neural networks.
17
Deep QLearning
●
The Qfunction can be approximated using a neural network model.
18
●
To efficiently evaluate maxa´Q(s´, a´), one should use the below
architecture.
19
Qvalues can be any real values, which makes it a regression task, that can be
optimized with simple squared error loss.
}
Exp. Replay Target Prediction
●
Targets depend on the network weights; this is in contrast with the targets
used for supervised learning, which are fixed before learning begins.
●
we hold the parameters from the previous iteration fixed when optimizing
the ith loss function
20
Gradient Update Rule
●
Stochastic gradient descent was used to optimize loss function
21
Network Model
22
●
Network is a classical convolutional neural network with three convolutional
layers, followed by two fully connected layers.
●
There are no pooling layers:
– Pooling is for translation invariance.
– For games, the location of the ball i.e states is crucial in determining the potential
reward.
●
4 last frames, each 84x84:
– To understand the last taken action e.g. ball speed, agent direction etc.
●
Sequences of actions and observations, are input to the algorithm, which then learns
game strategies depending upon these sequences.
●
Discount factor ° was set to 0.99 throughout
●
Outputs of the network are Qvalues for each possible action (18 in Atari).
23
Training Details
●
49 Atari 2600 games.
●
A different network for each game
●
Reward clipping:
– As the scale of scores varies greatly from game to game, we clipped all positive rewards at
1 and all negative rewards at 1, leaving 0 rewards unchanged.
– Clipping the rewards in this manner limits the scale of the error derivatives and makes it
easier to use the same learning rate across multiple games.
– It could affect the performance of our agent since it can’t differentiate between rewards
of different magnitude.
●
Minibatch size 32.
●
The behaviour policy during training was εgreedy.
●
ε decreases over time from 1 to 0.1 – in the beginning the system makes completely random
moves to explore the state space maximally, and then it settles down to a fixed exploration rate.
24
Training Details
●
Frame skipping:
– the agent sees and selects actions on every kth frame instead of every frame, and
its last action is repeated on skipped frames.
– This technique allows the agent to play roughly k times more games without
significantly increasing the runtime.
25
●
Problem: Approximation of Qvalues using nonlinear functions is not very
stable.
●
Reason: Correlation present in the sequence of observations.
●
How: Small updates to Q may significantly change the policy and data
distribution.
●
Solution: Experience replay to remove correlations in the observation.
Seperate target network to remove correlations with the target.
26
Experience Replay
●
During gameplay all the experiences <s,a,r,s´> are stored in a replay
memory.
●
When training the network, random minibatches from the replay memory are
used instead of the most recent transition.
●
This breaks the similarity of subsequent training samples, which otherwise
might drive the network into a local minimum.
●
Experience replay makes the training task more similar to usual supervised
learning, which simplifies debugging and testing the algorithm.
●
One could actually collect all those experiences from human gameplay and
then train network on these.
27
Experience Replay
●
Each step of experience is potentially used in many weight updates, which
allows for greater data efficiency.
●
Learning directly from consecutive samples is inefficient, owing to the strong
correlations between the samples.
●
Randomizing the samples breaks these correlations and therefore reduces the
variance of the updates.
●
By using experience replay the behaviour distribution is averaged over many
of its previous states, smoothing out learning and avoiding oscillations or
divergence in the parameters.
●
Algorithm only stores the last N experience tuples in the replay memory, and
samples uniformly at random from D when performing updates.
28
Experience Replay
●
To remove correlations, build dataset from agent’s own experience.
●
Sample experiences from dataset and apply update.
29
Experience Replay Analogy
30
Seperate Target Network
●
To improve the stability of method with neural networks is to use a separate
network for generating the targets in the Qlearning update.
●
Every C updates, clone the network Q to obtain a target network Q´ and use
Q´ for generating the Qlearning targets for the following C updates to Q.
●
This modification makes the algorithm more stable compared to standard
online Qlearning
●
Reduces oscillations or divergence of the policy.
●
Generating the targets using an older set of parameters adds a delay between
the time an update to Q is made and the time the update affects the targets,
making divergence or oscillations much more unlikely.
31
Seperate Target Network
Q´ Q
Target Network Prediction Network
Input
32
ExplorationExploitation
●
Firstly observe, that when a Qtable or Qnetwork is initialized randomly, then its
predictions are initially random as well.
●
If we pick an action with the highest Qvalue, the action will be random and the agent
performs crude “exploration”.
●
As a Qfunction converges, it returns more consistent Qvalues and the amount of
exploration decreases.
●
So one could say, that Qlearning incorporates the exploration as part of the
algorithm.
●
A simple and effective fix for the above problem is εgreedy exploration – with
probability ε choose a random action, otherwise go with the “greedy” action with the
highest Qvalue.
●
In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the
beginning the system makes completely random moves to explore the state space
maximally, and then it settles down to a fixed exploration rate.
33
Algorithm
34
Experiments
●
Atari 2600 platform, 49 games.
●
Same network architecture for all tasks.
●
Input: visual images & number of actions
●
Results compared with:
– Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning
environment: An evaluation platform for general agents. J. Artif. Intell. Res.
– Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness
using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell.
– Human players.
35
Results
●
“Our DQN method outperforms the best existing reinforcement learning
methods on 43 of the games without incorporating any of the additional
prior knowledge about Atari 2600 games used by other approaches.”
●
“Our DQN agent performed at a level that was comparable to that of a pro
fessional human games tester across the set of 49 games, achieving more
than 75% of the human score on more than half of the games. (29 games)
36
37
●
In certain games DQN is able to discover a relatively longterm strategy.
– Breakout game: first dig a tunnel around the side of the wall, send the ball back.
39
●
Nevertheless, games demanding more temporally extended planning
strategies still constitute a major challenge for all existing agents including
DQN
– Montezuma’s Revenge
40
41
42
43
●
Single architecture can successfully learn control policies in a range of
different environments
– Minimal prior knowledge,
– Only pixels and game score as input,
– Same algorithm,
– Same architecture,
– Same hyperparameters,
– “Just like a human player”
45
Thank you!
?/+/
46