0% found this document useful (0 votes)

19 views46 pages

DQN Muhammed

The document discusses the development of Deep-Q-Network (DQN) by Google DeepMind, which utilizes deep reinforcement learning to achieve human-level control in Atari 2600 games. It outlines the principles of reinforcement learning, the Q-learning algorithm, and the architecture of the DQN, including techniques like experience replay and separate target networks to enhance stability and performance. The results indicate that DQN outperforms existing methods in 43 out of 49 games tested, demonstrating super-human capabilities.

Uploaded by

Samya Miranda Gomes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views46 pages

DQN Muhammed

Uploaded by

Samya Miranda Gomes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Humanlevel control through deep

reinforcement learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver et. al.

Google Deepmind

Presented by
Muhammed Kocabas

Figure from paper

Outline

●
Introduction
●
What is Reinforcement Learning?
●
QLearning
●
Background and related work
●
Methods
●
Experiments
●
Results
●
Conclusion
●
References

Figure from link

2
Introduction

●
Former reinforcement learning agents are successful in some domains in which useful
features can be handcrafted, or in fully observed, low dimensional state spaces.
●
Authors used the recent advances in deep neural network training, and developed
DeepQNetwork.
●
DeepQNetwork:
– Learns successful policies
– Endtoend RL method.
– In Atari 2600 games, receives only the pixels and the game score as inputs just like a
human learns and perform superhuman capabilities in half of the games.
●
Comparison: Human players, linear learners

3
What is reinforcement learning?
Types of Machine Learning

Supervised Unsupervised Reinforcement

Task driven Data driven Learns to react an environment

(Regression/Classification) (Clustering) Sparse and time delayed labels
(Rewards)

RL is the problem of getting an agent to act in the world so as to

maximize its rewards.

4
What is reinforcement learning?

Figure from link

5
What is reinforcement learning?

●
No explicit training data set.
●
Nature provides reward for each of the learners actions.
●
At each time,
– Learner has a state and choses an action.
– Nature responds with new state and a reward.
– Learner learns from reward and makes better decisions.

6
What is reinforcement learning?

●
Main goal is to maximizing the reward Rt
●
Looking at only immediate rewards wouldn’t work well.
●
We need to take into account “future” rewards.
● At time t, the total future reward is:
Rt = rt + rt+1 + … + rn
●
We want to take the action that maximizes Rt
●
But we have to consider the fact that:
The environment is stochastic

7
Discounted future rewards

●
We can never be sure, if we will get the same rewards the next time we
perform the same actions. The more into the future we go, the more it may
diverge.
●
Main goal is to maximizing the reward Rt
Rt = rt + ° rt+1 + ° 2 rt+2 + … + ° n-t rn
●
° is the discount factor between 0 and 1.
●
The more into the future the reward is, the less we take it into consideration.
●
Simply:
Rt = rt + ° (rt+1 + ° (rt+2 + … ))
Rt = rt + ° Rt+1

8
QFunction

Q(st, at) = max Rt+1

●
The maximum discounted future reward when we perform action a in state
s, and continue optimally from that point on.
●
The way to think about Q(st, at) is that it is “the best possible score at the end
of the game after performing action a in state s”.
●
It is called Qfunction, because it represents the “quality” of a certain action
in a given state.

9
Policy & Bellman equation

¼(s) = argmaxa Q(s, a)

●
¼ represents the policy, the rule how we choose an action in each state.

Q(s, a) = r + ° maxa´Q(s´, a´)

●
This is called the Bellman equation.
●
Maximum future reward for this state and action is the immediate reward plus
maximum future reward for the next state.
●
The basic idea behind many reinforcement learning algorithms is to estimate the
actionvalue function by using the Bellman equation as an iterative update

10
QLearning

●
α in the algorithm is a learning rate that controls how much of the difference
between previous Qvalue and newly proposed Qvalue is taken into account.
● The maxa´Q(s´, a´) that we use to update Q(s, a) is only an approximation
and in early stages of learning it may be completely wrong. However the
approximation get more and more accurate with every iteration and it has been
shown, that if we perform this update enough times, then the Qfunction will
converge and represent the true Qvalue.
11
Related Work

●
Neural fitted Qiteration (2005)
– Riedmiller, M. Neural fitted Q iteration first experiences with a data efficient neural
reinforcement learning method. Mach. Learn.: ECML
●
Deep autoencoder NN in RL (2010)
– Lange, S. & Riedmiller, M. Deep autoencoder neural networks in reinforcement learning.
Proc. Int. Jt. Conf. Neural. Netw.
Comparison Papers:
●
The arcade learning environment: An evaluation platform for general agents (2013)
– Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment:
An evaluation platform for general agents. J. Artif. Intell. Res.
●
Investigating contingency awareness using Atari 2600 games (2012)
– Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness using
Atari 2600 games. Proc. Conf. AAAI. Artif. Intell.

12
Neural Fitted QIteration

●
This paper introduces NFQ, an algorithm for efficient and effective training of
a Qvalue function represented by a multilayer perceptron.
●
The main drawback of this type of architecture is that a separate forward
pass is required to compute the Qvalue of each action, resulting in a cost
that scales linearly with the number of actions.
●
This method involve the repeated training of networks de novo on hundreds
of iterations.

Figure from paper

Deep AutoEncoder NN in RL

●
This paper tries to solve visual reinforcement learning problems e.g. simple
mazes.
●
Propose to use deep autoencoder nn’s to to obtain low dimensional feature
space by extracting representative features from states.
●
Then, uses kernel based approximators e.g. FQI (Fitted Q Iterations) to
approximate Qfunction over feature vectors outputted by DAE.

Figures from paper 14

Investigating contingency awareness using
Atari 2600 games
●
Contingency awareness is the recognition that a future observation is under
an agent’s control and not solely determined by the environment.
●
The contingent regions of an observation are the components whose value is
dependent on the most recent choice of action.
●
They tries to learn the contingent regions in game play with a contingency
learning method.
●
Contingency learning method is a logistic regression classifier that can
predict whether or not a pixel belongs to the contingent regions within an
arbitrary Atari 2600 game

Figure from paper

The arcade learning environment: An
evaluation platform for general agents
●
This paper introduces ALE – Arcade Learning Environment as a testbed for
RL algorithms.
●
Paper is basically a benchmark paper of feature generation methods with
linear function approximation method.
●
Feature generation methods are:

16
Deep QLearning

●
The basic idea behind many reinforcement learning algorithms is to estimate
the actionvalue function by using the Bellman equation as an iterative
update.
Qi+1(s, a) = r + ° maxa´Qi(s´, a´)
●
Such value iteration algorithms converge to the optimal actionvalue
function, Qi+1 → Q* as i →
8
●
In practice, this basic approach is impractical, because the actionvalue
function is estimated separately for each sequence, without any
generalization.
●
It is common to use a function approximators to estimate the actionvalue
function as linear or nonlinear function approximators e.g. neural networks.

17
Deep QLearning

●
The Qfunction can be approximated using a neural network model.

Naive formulation of deep Qnetwork. More optimized architecture of deep Q

network, used in DeepMind paper.

Figures from link

Deep QLearning

●
To efficiently evaluate maxa´Q(s´, a´), one should use the below
architecture.

Figure from link Figure from paper

Loss Function

Qvalues can be any real values, which makes it a regression task, that can be
optimized with simple squared error loss.

}
Exp. Replay Target Prediction
●
Targets depend on the network weights; this is in contrast with the targets
used for supervised learning, which are fixed before learning begins.
●
we hold the parameters from the previous iteration fixed when optimizing
the ith loss function

20
Gradient Update Rule

●
Stochastic gradient descent was used to optimize loss function

21
Network Model

Figure from paper

Network Model

●
Network is a classical convolutional neural network with three convolutional
layers, followed by two fully connected layers.
●
There are no pooling layers:
– Pooling is for translation invariance.
– For games, the location of the ball i.e states is crucial in determining the potential
reward.
●
4 last frames, each 84x84:
– To understand the last taken action e.g. ball speed, agent direction etc.
●
Sequences of actions and observations, are input to the algorithm, which then learns
game strategies depending upon these sequences.
●
Discount factor ° was set to 0.99 throughout
●
Outputs of the network are Qvalues for each possible action (18 in Atari).

23
Training Details

●
49 Atari 2600 games.
●
A different network for each game
●
Reward clipping:
– As the scale of scores varies greatly from game to game, we clipped all positive rewards at
1 and all negative rewards at 1, leaving 0 rewards unchanged.
– Clipping the rewards in this manner limits the scale of the error derivatives and makes it
easier to use the same learning rate across multiple games.
– It could affect the performance of our agent since it can’t differentiate between rewards
of different magnitude.
●
Minibatch size 32.
●
The behaviour policy during training was εgreedy.
●
ε decreases over time from 1 to 0.1 – in the beginning the system makes completely random
moves to explore the state space maximally, and then it settles down to a fixed exploration rate.

24
Training Details

●
Frame skipping:
– the agent sees and selects actions on every kth frame instead of every frame, and
its last action is repeated on skipped frames.
– This technique allows the agent to play roughly k times more games without
significantly increasing the runtime.

Figure from link

Unstable Nonlinear QLearning

●
Problem: Approximation of Qvalues using nonlinear functions is not very
stable.
●
Reason: Correlation present in the sequence of observations.
●
How: Small updates to Q may significantly change the policy and data
distribution.
●
Solution: Experience replay to remove correlations in the observation.
Seperate target network to remove correlations with the target.

26
Experience Replay

●
During gameplay all the experiences <s,a,r,s´> are stored in a replay
memory.
●
When training the network, random minibatches from the replay memory are
used instead of the most recent transition.
●
This breaks the similarity of subsequent training samples, which otherwise
might drive the network into a local minimum.
●
Experience replay makes the training task more similar to usual supervised
learning, which simplifies debugging and testing the algorithm.
●
One could actually collect all those experiences from human gameplay and
then train network on these.

27
Experience Replay

●
Each step of experience is potentially used in many weight updates, which
allows for greater data efficiency.
●
Learning directly from consecutive samples is inefficient, owing to the strong
correlations between the samples.
●
Randomizing the samples breaks these correlations and therefore reduces the
variance of the updates.
●
By using experience replay the behaviour distribution is averaged over many
of its previous states, smoothing out learning and avoiding oscillations or
divergence in the parameters.
●
Algorithm only stores the last N experience tuples in the replay memory, and
samples uniformly at random from D when performing updates.

28
Experience Replay

●
To remove correlations, build dataset from agent’s own experience.
●
Sample experiences from dataset and apply update.

Replay Buffer – fixed size

29
Experience Replay Analogy

SequentialCorrelated Varied Data Dist.

30
Seperate Target Network

●
To improve the stability of method with neural networks is to use a separate
network for generating the targets in the Qlearning update.
●
Every C updates, clone the network Q to obtain a target network Q´ and use
Q´ for generating the Qlearning targets for the following C updates to Q.
●
This modification makes the algorithm more stable compared to standard
online Qlearning
●
Reduces oscillations or divergence of the policy.
●
Generating the targets using an older set of parameters adds a delay between
the time an update to Q is made and the time the update affects the targets,
making divergence or oscillations much more unlikely.

31
Seperate Target Network

Parameter update at every

C iterations

Q´ Q
Target Network Prediction Network

Input
32
ExplorationExploitation

●
Firstly observe, that when a Qtable or Qnetwork is initialized randomly, then its
predictions are initially random as well.
●
If we pick an action with the highest Qvalue, the action will be random and the agent
performs crude “exploration”.
●
As a Qfunction converges, it returns more consistent Qvalues and the amount of
exploration decreases.
●
So one could say, that Qlearning incorporates the exploration as part of the
algorithm.
●
A simple and effective fix for the above problem is εgreedy exploration – with
probability ε choose a random action, otherwise go with the “greedy” action with the
highest Qvalue.
●
In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the
beginning the system makes completely random moves to explore the state space
maximally, and then it settles down to a fixed exploration rate.

33
Algorithm

34
Experiments

●
Atari 2600 platform, 49 games.
●
Same network architecture for all tasks.
●
Input: visual images & number of actions
●
Results compared with:
– Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning
environment: An evaluation platform for general agents. J. Artif. Intell. Res.
– Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness
using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell.
– Human players.

35
Results

●
“Our DQN method outperforms the best existing reinforcement learning
methods on 43 of the games without incorporating any of the additional
prior knowledge about Atari 2600 games used by other approaches.”
●
“Our DQN agent performed at a level that was comparable to that of a pro
fessional human games tester across the set of 49 games, achieving more
than 75% of the human score on more than half of the games. (29 games)

36
37

Figure from paper

States that are close in terms of reward expectation but perceptually dissimilar. 38

Figure from paper

Results

●
In certain games DQN is able to discover a relatively longterm strategy.
– Breakout game: first dig a tunnel around the side of the wall, send the ball back.

Figure from paper

Results

●
Nevertheless, games demanding more temporally extended planning
strategies still constitute a major challenge for all existing agents including
DQN
– Montezuma’s Revenge

Figure from link

Results

Figure from paper

Results

Figure from paper

Results

Figure from paper

Conclusion

●
Single architecture can successfully learn control policies in a range of
different environments
– Minimal prior knowledge,
– Only pixels and game score as input,
– Same algorithm,
– Same architecture,
– Same hyperparameters,
– “Just like a human player”

45
Thank you!

?/+/

Reinforcement Learning Theory and Applications 1st Edition Cornelius Weber Download
100% (2)
Reinforcement Learning Theory and Applications 1st Edition Cornelius Weber Download
61 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Esraa Khaled
No ratings yet
Esraa Khaled
27 pages
DQN Atari
No ratings yet
DQN Atari
26 pages
Lec 1 Intro Course Overview
No ratings yet
Lec 1 Intro Course Overview
50 pages
Reinforcement Learning - Introduction
No ratings yet
Reinforcement Learning - Introduction
19 pages
Human-Level Control Through Deep Reinforcement Learning - Nature
No ratings yet
Human-Level Control Through Deep Reinforcement Learning - Nature
11 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Nature 14236
No ratings yet
Nature 14236
13 pages
Playbook Executive Briefing Reinforcement Learning
No ratings yet
Playbook Executive Briefing Reinforcement Learning
20 pages
Reinforcement Learning Presentation
No ratings yet
Reinforcement Learning Presentation
9 pages
Lec 23
No ratings yet
Lec 23
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
8 pages
A Concise Introduction To Reinforcement Learning: February 2018
No ratings yet
A Concise Introduction To Reinforcement Learning: February 2018
12 pages
18 Deeprl
No ratings yet
18 Deeprl
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Lecture 1: Introduction: Reinforcement Learning With Tensorflow&Openai Gym
No ratings yet
Lecture 1: Introduction: Reinforcement Learning With Tensorflow&Openai Gym
18 pages
Introduction To Deep Reinforcement Learning
No ratings yet
Introduction To Deep Reinforcement Learning
7 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
Introduction To Reinforcement Learning and Its Applications
No ratings yet
Introduction To Reinforcement Learning and Its Applications
2 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
SL Week01
No ratings yet
SL Week01
13 pages
Reinforcement Learning in AI
No ratings yet
Reinforcement Learning in AI
4 pages
Reinf Learning Res Paper 2
No ratings yet
Reinf Learning Res Paper 2
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Introduction To Reinforcement Learning: Presented by - Rohit Mahto
No ratings yet
Introduction To Reinforcement Learning: Presented by - Rohit Mahto
9 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Four
No ratings yet
Four
5 pages
Lecture Notes On Reinforcement Learning Basics
No ratings yet
Lecture Notes On Reinforcement Learning Basics
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
2 pages
3GP ML Reinforcement Learning
No ratings yet
3GP ML Reinforcement Learning
3 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
Human-Level Control Through Deep Reinforcement Learning
No ratings yet
Human-Level Control Through Deep Reinforcement Learning
13 pages
Unit 4
No ratings yet
Unit 4
56 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Deep Learning-Question Bank-Module-Wise
67% (3)
Deep Learning-Question Bank-Module-Wise
5 pages
1 s2.0 S0925231220303337 Main
No ratings yet
1 s2.0 S0925231220303337 Main
12 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
Vibration Analysis Guide
No ratings yet
Vibration Analysis Guide
40 pages
Reinforcement Learning (RL) : Agent
No ratings yet
Reinforcement Learning (RL) : Agent
35 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
9 pages
OcNOS-SP Config Guide
No ratings yet
OcNOS-SP Config Guide
5,404 pages
Storage Media and Devices: IGCSE - 0417
0% (1)
Storage Media and Devices: IGCSE - 0417
35 pages
L-14 - Reinforcement-L-d-07062024-111949am
No ratings yet
L-14 - Reinforcement-L-d-07062024-111949am
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
No ratings yet
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
12 pages
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
Deep Reinforcement Learning Mohit Sewak
No ratings yet
Deep Reinforcement Learning Mohit Sewak
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Reinforcement learning-WPS Office
No ratings yet
Reinforcement learning-WPS Office
1 page
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
Reinforcement Learning - Basics
No ratings yet
Reinforcement Learning - Basics
7 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Study-Guide-Css - Work in Team Environment
No ratings yet
Study-Guide-Css - Work in Team Environment
4 pages
BTC
No ratings yet
BTC
6 pages
H11AA1, H11AA3, H11AA2, H11AA4 AC Input/Phototransistor Optocouplers
No ratings yet
H11AA1, H11AA3, H11AA2, H11AA4 AC Input/Phototransistor Optocouplers
8 pages
User Manual P60
100% (1)
User Manual P60
160 pages
Snote - You Do Not Have A Maintenance Contract For This
No ratings yet
Snote - You Do Not Have A Maintenance Contract For This
3 pages
Resource Estimation: Professor Emmanuel Chanda Curtin University of Technology/UNZA, SCH of Mines
No ratings yet
Resource Estimation: Professor Emmanuel Chanda Curtin University of Technology/UNZA, SCH of Mines
32 pages
P 650 Se
No ratings yet
P 650 Se
126 pages
MECH3780 Fluid Mechanics 2 and CFD: Computation Fluid Dynamics (CFD) Lecture 6 - Evaluation of Numerical Solutions
No ratings yet
MECH3780 Fluid Mechanics 2 and CFD: Computation Fluid Dynamics (CFD) Lecture 6 - Evaluation of Numerical Solutions
12 pages
Asterisk Gateway Interface (Agi) Scripting in Python: Muhammad Morshed Alam Amberit LTD
No ratings yet
Asterisk Gateway Interface (Agi) Scripting in Python: Muhammad Morshed Alam Amberit LTD
19 pages
PET/CT: Current Status in India: Ommentary
No ratings yet
PET/CT: Current Status in India: Ommentary
5 pages
Syed Mudassir Shah - CV 2025
No ratings yet
Syed Mudassir Shah - CV 2025
3 pages
Gmail - Booking Confirmation On IRCTC, Train - 03401, 25-Sep-2021, CC, BGP - PNBE
No ratings yet
Gmail - Booking Confirmation On IRCTC, Train - 03401, 25-Sep-2021, CC, BGP - PNBE
1 page
Function Generator Block Diagram and Working Principle - ETechnoG
No ratings yet
Function Generator Block Diagram and Working Principle - ETechnoG
4 pages
Las464b.dusallu Web Eng MFL69681808 180504
No ratings yet
Las464b.dusallu Web Eng MFL69681808 180504
24 pages
Module 1 - Unit 2-LE2
No ratings yet
Module 1 - Unit 2-LE2
18 pages
Utilizing Canva As Digital Tools To Teach Grammar in Remote Learning Period
No ratings yet
Utilizing Canva As Digital Tools To Teach Grammar in Remote Learning Period
8 pages
Tinder Questionnaire
No ratings yet
Tinder Questionnaire
4 pages
Eee141l Final Question
No ratings yet
Eee141l Final Question
5 pages
43440
No ratings yet
43440
36 pages
MoorDyn Users Guide 2017-08-16
No ratings yet
MoorDyn Users Guide 2017-08-16
17 pages
Digital Transformation For 18 Hydroelectric Power Plants
No ratings yet
Digital Transformation For 18 Hydroelectric Power Plants
2 pages
Network PDF
No ratings yet
Network PDF
18 pages
Daa Lab-2
No ratings yet
Daa Lab-2
6 pages
Lê Tiến Huy: Objective
No ratings yet
Lê Tiến Huy: Objective
2 pages
Social Networking Presentation
No ratings yet
Social Networking Presentation
1 page
Statistical Softwares: Excel Stata Spss
No ratings yet
Statistical Softwares: Excel Stata Spss
1 page
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet