0% found this document useful (0 votes)

32 views68 pages

Lecture 6

This document provides an overview of convolutional neural networks (CNNs) and deep Q-learning. It discusses how CNNs are commonly used in computer vision tasks when processing visual input like pixels. It also introduces deep neural networks as a function approximator for reinforcement learning problems, noting how they can learn distributed representations from states without requiring manually designed features. Finally, it outlines how deep Q-learning uses a deep neural network to approximate the Q-function in order to solve reinforcement learning tasks.

Uploaded by

Şeyma Satıcı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views68 pages

Lecture 6

Uploaded by

Şeyma Satıcı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

1

Lecture 6: CNNs and Deep Q Learning

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2019

1
With many slides for DQN from David Silver and Ruslan Salakhutdinov and some
vision slides from Gianni Di Caro and images from Stanford CS231n,
http://cs231n.github.io/convolutional-networks/
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 1 / 68
Table of Contents

1 Convolutional Neural Nets (CNNs)

2 Deep Q Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 2 / 68
Class Structure

Last time: Value function approximation

This time: RL with function approximation, deep RL

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 3 / 68
Generalization

Want to be able to use reinforcement learning to tackle self-driving

cars, Atari, consumer marketing, healthcare, education, . . .
Most of these domains have enormous state and/or action spaces
Requires representations (of models / state-action values / values /
policies) that can generalize across states and/or actions
Represent a (state-action/state) value function with a parameterized
function instead of a table

𝑠 𝑤 𝑉# (𝑠; 𝑤)

𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 4 / 68
Recall: Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ π (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(V π (s) − V̂ π (s; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) samples the gradient:

− ∇w J(w ) = Eπ [(V π (s) − V̂ π (s; w ))∇w V̂ π (s; w )]

1
2
∆w = α(V π (s) − V̂ π (s; w ))∇w V̂ π (s; w )
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 5 / 68
Last Time: Linear Value Function Approximation for
Prediction With An Oracle

Represent a value function (or state-action value function) for a

particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1

Objective function is

J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]

Recall weight update is

1
∆w = − α∇w J(w )
2

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 6 / 68
Last Time: Linear Value Function Approximation for
Prediction With An Oracle

Represent a value function (or state-action value function) for a

particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1

Objective function is J(w ) = Eπ [(V π (s) − V̂ π (s; w ))2 ]

Recall weight update is ∆w = − 12 α∇w J(w )
For MC policy evaluation

∆w = α(Gt − x(st )T w )x(st )

For TD policy evaluation

∆w = α(rt + γx(st+1 )T w − x(st )T w )x(st )

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 7 / 68
RL with Function Approximator
Linear value function approximators assume value function is a
weighted combination of a set of features, where each feature a
function of the state
Linear VFA often work well given the right set of features
But can require carefully hand designing that feature set
An alternative is to use a much richer function approximation class
that is able to directly go from states without requiring an explicit
specification of features
Local representations including Kernel based approaches have some
appealing properties (including convergence results under certain
cases) but can’t typically scale well to enormous spaces and datasets

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 8 / 68
Deep Neural Networks (DNN)

Composition of multiple functions

Can use the chain rule to backpropagate the gradient

Major innovation: tools to automatically compute gradients for a

DNN

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 9 / 68
Deep Neural Networks (DNN) Specification and Fitting

Generally combines both linear and non-linear transformations

Linear:
Non-linear:
To fit the parameters, require a loss function (MSE, log likelihood etc)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 10 / 68
The Benefit of Deep Neural Network Approximators
Linear value function approximators assume value function is a
weighted combination of a set of features, where each feature a
function of the state
Linear VFA often work well given the right set of features
But can require carefully hand designing that feature set
An alternative is to use a much richer function approximation class
that is able to directly go from states without requiring an explicit
specification of features
Local representations including Kernel based approaches have some
appealing properties (including convergence results under certain
cases) but can’t typically scale well to enormous spaces and datasets
Alternative: Deep neural networks
Uses distributed representations instead of local representations
Universal function approximator
Can potentially need exponentially less nodes/parameters (compared to
a shallow net) to represent the same function
Can learn the parameters using stochastic gradient descent
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 11 / 68
Table of Contents

1 Convolutional Neural Nets (CNNs)

2 Deep Q Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 12 / 68
Why Do We Care About CNNs?

CNNs extensively used in computer vision

If we want to go from pixels to decisions, likely useful to leverage
insights for visual input

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 13 / 68
Fully Connected Neural Net

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 14 / 68
Fully Connected Neural Net

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 15 / 68
Fully Connected Neural Net

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 16 / 68
Images Have Structure

Have local structure and correlation

Have distinctive features in space & frequency domains

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 17 / 68
Convolutional NN

Consider local structure and common extraction of features

Not fully connected
Locality of processing
Weight sharing for parameter reduction
Learn the parameters of multiple convolutional filter banks
Compress to extract salient features & favor generalization

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 18 / 68
Locality of Information: Receptive Fields

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 19 / 68
(Filter) Stride

Slide the 5x5 mask over all the input pixels

Stride length = 1
Can use other stride lengths
Assume input is 28x28, how many neurons in 1st hidden layer?

Zero padding: how many 0s to add to either side of input layer

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 20 / 68
Shared Weights

What is the precise relationship between the neurons in the receptive

field and that in the hidden layer?
What is the activation value of the hidden layer neuron?
X
g (b + wi xi )
i

Sum over i is only over the neurons in the receptive field of the
hidden layer neuron
The same weights w and bias b are used for each of the hidden
neurons
In this example, 24 × 24 hidden neurons

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 21 / 68
Ex. Shared Weights, Restricted Field

Consider 28x28 input image

24x24 hidden layer

Receptive field is 5x5

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 22 / 68
Feature Map

All the neurons in the first hidden layer detect exactly the same
feature, just at different locations in the input image.
Feature: the kind of input pattern (e.g., a local edge) that makes the
neuron produce a certain response level
Why does this makes sense?
Suppose the weights and bias are (learned) such that the hidden
neuron can pick out, a vertical edge in a particular local receptive field.
That ability is also likely to be useful at other places in the image.
Useful to apply the same feature detector everywhere in the image.
Yields translation (spatial) invariance (try to detect feature at any part
of the image)
Inspired by visual system

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 23 / 68
Feature Map

The map from the input layer to

the hidden layer is therefore a
feature map: all nodes detect
the same feature in different
parts
The map is defined by the
shared weights and bias
The shared map is the result of
the application of a
convolutional filter (defined by
weights and bias), also known as
convolution with learned kernels

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 24 / 68
Convolutional Layer: Multiple Filters Ex.1

1
http://cs231n.github.io/convolutional-networks/
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 25 / 68
Pooling Layers

Pooling layers are usually used immediately after convolutional layers.

Pooling layers simplify / subsample / compress the information in the
output from convolutional layer
A pooling layer takes each feature map output from the convolutional
layer and prepares a condensed feature map

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 26 / 68
Final Layer Typically Fully Connected

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 27 / 68
Table of Contents

1 Convolutional Neural Nets (CNNs)

2 Deep Q Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 28 / 68
Generalization

Using function approximation to help scale up to making decisions in

really large domains

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 29 / 68
Deep Reinforcement Learning

Use deep neural networks to represent

Value function
Policy
Model
Optimize loss function by stochastic gradient descent (SGD)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 30 / 68
Deep Q-Networks (DQNs)

Represent state-action value function by Q-network with weights w

Q̂(s, a; w ) ≈ Q(s, a)

𝑠 𝑤 𝑉# (𝑠; 𝑤)

𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 31 / 68
Recall: Action-Value Function Approximation with an
Oracle

Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:

J(w ) = Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]

Use stochastic gradient descent to find a local minimum

1
Eπ
h i
− ∇W J(w ) = (Q π (s, a) − Q̂ π (s, a; w ))∇w Q̂ π (s, a; w )
2
1
∆(w ) = − α∇w J(w )
2
Stochastic gradient descent (SGD) samples the gradient

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 32 / 68
Recall: Incremental Model-Free Control Approaches

Similar to policy evaluation, true state-action value function for a

state is unknown and so substitute a target value
In Monte Carlo methods, use a return Gt as a substitute target

∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

For SARSA instead use a TD target r + γ Q̂(st+1 , at+1 ; w ) which

leverages the current function approximation value

∆w = α(r + γ Q̂(st+1 , at+1 ; w ) − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

For Q-learning instead use a TD target r + γ maxa Q̂(st+1 , a; w )

which leverages the max of the current function approximation value

∆w = α(r + γ max Q̂(st+1 , a; w ) − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 33 / 68
Using these ideas to do Deep RL in Atari

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 34 / 68
DQNs in Atari

End-to-end learning of values Q(s, a) from pixels s

Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 35 / 68
DQNs in Atari

End-to-end learning of values Q(s, a) from pixels s

Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 36 / 68
Q-Learning with Value Function Approximation

Minimize MSE loss by stochastic gradient descent

Converges to the optimal Q ∗ (s, a) using table lookup representation
But Q-learning with VFA can diverge
Two of the issues causing problems:
Correlations between samples
Non-stationary targets
Deep Q-learning (DQN) addresses both of these challenges by
Experience replay
Fixed Q-targets

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 37 / 68
DQNs: Experience Replay

To help remove correlations, store dataset (called a replay buffer) D

from prior experience
𝑠" , 𝑎" , 𝑟" , 𝑠&
𝑠& , 𝑎& , 𝑟& , 𝑠' 𝑠, 𝑎, 𝑟, 𝑠′
𝑠' , 𝑎' , 𝑟' , 𝑠(
…

𝑠) , 𝑎) , 𝑟) , 𝑠)*"

To perform experience replay, repeat the following:

(s, a, r , s 0 ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa0 Q̂(s 0 , a0 ; w )
Use stochastic gradient descent to update the network weights

∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 38 / 68
DQNs: Experience Replay

To help remove correlations, store dataset D from prior experience

𝑠" , 𝑎" , 𝑟" , 𝑠&
𝑠& , 𝑎& , 𝑟& , 𝑠' 𝑠, 𝑎, 𝑟, 𝑠′
𝑠' , 𝑎' , 𝑟' , 𝑠(
…

𝑠) , 𝑎) , 𝑟) , 𝑠)*"

To perform experience replay, repeat the following:

∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Can treat the target as a scalar, but the weights will get
updated on the next round, changing the target value

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 39 / 68
DQNs: Fixed Q-Targets

To help improve stability, fix the target weights used in the target
calculation for multiple updates
Use a different set of weights to compute target than is being updated
Let parameters w − be the set of weights used in the target, and w
be the weights that are being updated
Slight change to computation of target value:
(s, a, r , s 0 ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa0 Q̂(s 0 , a0 ; w − )
Use stochastic gradient descent to update the network weights

∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w − ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 40 / 68
DQNs Summary

DQN uses experience replay and fixed Q-targets

Store transition (st , at , rt+1 , st+1 ) in replay memory D
Sample random mini-batch of transitions (s, a, r , s 0 ) from D
Compute Q-learning targets w.r.t. old, fixed parameters w −
Optimizes MSE between Q-network and Q-learning targets
Uses stochastic gradient descent

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 41 / 68
DQN

Figure: Human-level control through deep reinforcement learning, Mnih et al,

2015

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 42 / 68
Demo

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 43 / 68
DQN Results in Atari

Figure: Human-level control through deep reinforcement learning, Mnih et al,

2015
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 44 / 68
Which Aspects of DQN were Important for Success?

Deep DQN w/ DQN w/ DQN w/replay

Game Linear
Network fixed Q replay and fixed Q
Breakout 3 3 10 241 317
Enduro 62 29 141 831 1006
River Raid 2345 1453 2868 4102 7447
Seaquest 656 275 1003 823 2894
Space
301 302 373 826 1089
Invaders

Replay is hugely important

Why? Beyond helping with correlation between samples, what does
replaying do?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 45 / 68
Deep RL

Success in Atari has led to huge excitement in using deep neural

networks to do value function approximation in RL
Some immediate improvements (many others!)
Double DQN (Deep Reinforcement Learning with Double Q-Learning,
Van Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network Architectures
for Deep Reinforcement Learning, Wang et al, ICML 2016)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 46 / 68
Double DQN

Recall maximization bias challenge

Max of the estimated state-action values can be a biased estimate of
the max
Double Q-learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 47 / 68
Recall: Double Q-Learning

1: Initialize Q1 (s, a) and Q2 (s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0

2: loop
3: Select at using -greedy π(s) = arg maxa Q1 (st , a) + Q2 (st , a)
4: Observe (rt , st+1 )
5: if (with 0.5 probability True) then
6:

Q1 (st , at ) ← Q1 (st , at )+α(rt +Q1 (st+1 , arg max

0
Q2 (st+1 , a0 ))−Q1 (st , at ))
a

7: else
8:

Q2 (st , at ) ← Q2 (st , at )+α(rt +Q2 (st+1 , arg max

0
Q1 (st+1 , a0 ))−Q2 (st , at ))
a

9: end if
10: t =t +1
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 48 / 68
Double DQN

Extend this idea to DQN

Current Q-network w is used to select actions
Older Q-network w − is used to evaluate actions
Action evaluation: w −
z }| {
0 0 −
∆w = α(r + γ Q̂(arg max Q̂(s , a ; w ) ; w ) −Q̂(s, a; w ))
a0
| {z }
Action selection: w

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 49 / 68
Double DQN

Figure: van Hasselt, Guez, Silver, 2015

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 50 / 68
Deep RL

Success in Atari has led to huge excitement in using deep neural

networks to do value function approximation in RL
Some immediate improvements (many others!)
DQN (Deep Reinforcement Learning with Double Q-Learning, Van
Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network Architectures
for Deep Reinforcement Learning, Wang et al, ICML 2016)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 51 / 68
Refresher: Mars Rover Model-Free Policy Evaluation

!" !# !$ !% !& !' !(

) !" = +1 ) !# = 0 ) !$ = 0 ) !% = 0 ) !& = 0 ) !' = 0 ) !( = +10

89/: ./01/!123
.2456 7214 .2456 7214

Mars rover: R = [ 1 0 0 0 0 0 +10] for any action

π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of V of each state? [1 1 1 0 0 0 0]
Every visit MC estimate of V of s2 ? 1
TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
Now get to chose 2 ”replay” backups to do. Which should we pick to
get best estimate?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 52 / 68
Impact of Replay?

In tabular TD-learning, order of replaying updates could help speed

learning
Repeating some updates seem to better propagate info than others
Systematic ways to prioritize updates?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 53 / 68
Potential Impact of Ordering Episodic Replay Updates

Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016

Schaul, Quan, Antonoglou, Silver ICLR 2016

Oracle: picks (s, a, r , s 0 ) tuple to replay that will minimize global loss
Exponential improvement in convergence
Number of updates needed to converge
Oracle is not a practical method but illustrates impact of ordering
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 54 / 68
Prioritized Experience Replay

Let i be the index of the i-the tuple of experience (si , ai , ri , si+1 )

Sample tuples for update using priority function
Priority of a tuple i is proportional to DQN error

0 −

pi = r + γ max

0
Q(si+1 , a ; w ) − Q(si , ai ; w )
a

Update pi every update

pi for new tuples is set to 0
One method1 : proportional (stochastic prioritization)

pα
P(i) = P i α
k pk

1
See paper for details and an alternative
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 55 / 68
Check Your Understanding

Let i be the index of the i-the tuple of experience (si , ai , ri , si+1 )

Sample tuples for update using priority function
Priority of a tuple i is proportional to DQN error

0 −

pi = r + γ max
0
Q(s i+1 , a ; w ) − Q(s i , ai ; w )
a

Update pi every update

pi for new tuples is set to 0
One method1 : proportional (stochastic prioritization)

pα
P(i) = P i α
k pk

α = 0 yields what rule for selecting among existing tuples?

1
See paper for details and an alternative
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 56 / 68
Performance of Prioritized Replay vs Double DQN

Figure: Schaul, Quan, Antonoglou, Silver ICLR 2016

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 57 / 68
Deep RL

Success in Atari has led to huge excitement in using deep neural

networks to do value function approximation in RL
Some immediate improvements (many others!)
DQN (Deep Reinforcement Learning with Double Q-Learning, Van
Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network
Architectures for Deep Reinforcement Learning, Wang et al, ICML
2016)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 58 / 68
Value & Advantage Function

Intuition: Features need to pay attention to determine value may be

different than those need to determine action benefit
E.g.
Game score may be relevant to predicting V (s)
But not necessarily in indicating relative action values
Advantage function (Baird 1993)

Aπ (s, a) = Q π (s, a) − V π (s)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 59 / 68
Dueling DQN

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 60 / 68
Identifiability

Advantage function

Aπ (s, a) = Q π (s, a) − V π (s)

Identifiable?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 61 / 68
Identifiability

Advantage function

Aπ (s, a) = Q π (s, a) − V π (s)

Unidentifiable
Option 1: Force A(s, a) = 0 if a is action taken

0
Q̂(s, a; w ) = V̂ (s; w ) + Â(s, a; w ) − max
0
Â(s, a ; w )
a ∈A

Option 2: Use mean as baseline (more stable)

!
1 X 0
Q̂(s, a; w ) = V̂ (s; w ) + Â(s, a; w ) − Â(s, a ; w )
|A| 0
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 62 / 68
Dueling DQN V.S. Double DQN with Prioritized Replay

Figure: Wang et al, ICML 2016

Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 63 / 68
Practical Tips for DQN on Atari (from J. Schulman)

DQN is more reliable on some Atari tasks than others. Pong is a

reliable task: if it doesn’t achieve good scores, something is wrong
Large replay buffers improve robustness of DQN, and memory
efficiency is key
Use uint8 images, don’t duplicate data
Be patient. DQN converges slowly—for ATARI it’s often necessary to
wait for 10-40M frames (couple of hours to a day of training on GPU)
to see results significantly better than random policy
In our Stanford class: Debug implementation on small test
environment

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 64 / 68
Practical Tips for DQN on Atari (from J. Schulman) cont.

Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 65 / 68
Practical Tips for DQN on Atari (from J. Schulman) cont.

Try Huber
( 2 loss on Bellman error
x
if |x| ≤ δ
L(x) = 2 δ2
δ|x| − 2 otherwise

Consider trying Double DQN—significant improvement from small

code change in Tensorflow.
To test out your data pre-processing, try your own skills at navigating
the environment based on processed frames
Always run at least two different seeds when experimenting
Learning rate scheduling is beneficial. Try high learning rates in initial
exploration period
Try non-standard exploration schedules
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 66 / 68
Table of Contents

1 Convolutional Neural Nets (CNNs)

2 Deep Q Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 67 / 68
Class Structure

Last time: Value function approximation

This time: RL with function approximation, deep RL

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 6: CNNs and Deep Q Learning 1 Winter 2019 68 / 68

4th Unit Aktu Machine Learning
No ratings yet
4th Unit Aktu Machine Learning
9 pages
Deep Learning and Applications: Pham The Bao Ptbao@sgu - Edu.vn
No ratings yet
Deep Learning and Applications: Pham The Bao Ptbao@sgu - Edu.vn
43 pages
Introduction To Deep Learning: Nandita Bhaskhar
No ratings yet
Introduction To Deep Learning: Nandita Bhaskhar
56 pages
Intro CNN PDF
No ratings yet
Intro CNN PDF
31 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Convolutional Neural Networks (LeNet) - DeepLearning 0.1 Documentation
No ratings yet
Convolutional Neural Networks (LeNet) - DeepLearning 0.1 Documentation
12 pages
Lecture 26-30 Unit 2
No ratings yet
Lecture 26-30 Unit 2
20 pages
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
No ratings yet
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
33 pages
Total Quality Management Exam - Prelim
100% (2)
Total Quality Management Exam - Prelim
4 pages
Machine Learning Tutorial
No ratings yet
Machine Learning Tutorial
149 pages
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Images and Convolutional Neural Networks: Practical Deep Learning
No ratings yet
Images and Convolutional Neural Networks: Practical Deep Learning
34 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
CNN
No ratings yet
CNN
31 pages
Chapter21 4e
No ratings yet
Chapter21 4e
35 pages
Neuralnetworks 1
No ratings yet
Neuralnetworks 1
65 pages
Class Notes Unit 5
No ratings yet
Class Notes Unit 5
13 pages
Deep Learning Notes For Easy Access
No ratings yet
Deep Learning Notes For Easy Access
14 pages
Dlincv 161110052148 PDF
No ratings yet
Dlincv 161110052148 PDF
271 pages
National Apprenticeship Training Scheme: Student User Manual
No ratings yet
National Apprenticeship Training Scheme: Student User Manual
41 pages
Introtodeeplearning MIT 6.S191
No ratings yet
Introtodeeplearning MIT 6.S191
36 pages
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
No ratings yet
CII4Q3 VISI KOMPUTER - Deep Learning - CNN
106 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
CV Mot
No ratings yet
CV Mot
69 pages
DEEP LEARNING Unit-2 NOTES For Post Graduation
No ratings yet
DEEP LEARNING Unit-2 NOTES For Post Graduation
11 pages
6-DeepVisualLearning L6
No ratings yet
6-DeepVisualLearning L6
82 pages
Rise and Shine Starter AmE SB 9781292421025 PDF
No ratings yet
Rise and Shine Starter AmE SB 9781292421025 PDF
10 pages
DL Intro
No ratings yet
DL Intro
64 pages
Week8 WEB
No ratings yet
Week8 WEB
54 pages
Continuous Time 2
No ratings yet
Continuous Time 2
91 pages
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
No ratings yet
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
50 pages
4b Image Processing
No ratings yet
4b Image Processing
63 pages
Machine Learning (PARAMETER'S RESUMES 2)
No ratings yet
Machine Learning (PARAMETER'S RESUMES 2)
8 pages
Unit 3
No ratings yet
Unit 3
105 pages
Antim Prahar AI and ML For Business 2025
No ratings yet
Antim Prahar AI and ML For Business 2025
45 pages
DL1 Ver1
No ratings yet
DL1 Ver1
49 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
Lec7 8+CNN 2
No ratings yet
Lec7 8+CNN 2
69 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
ML 11
No ratings yet
ML 11
62 pages
6 Lecture CNN
No ratings yet
6 Lecture CNN
45 pages
Machine Learning (CSO851) - Lecture 10
No ratings yet
Machine Learning (CSO851) - Lecture 10
83 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
First Quarter Module 1 Activities
100% (4)
First Quarter Module 1 Activities
2 pages
Lec 2
No ratings yet
Lec 2
42 pages
Deep Learning Report For Students
No ratings yet
Deep Learning Report For Students
32 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Learn Words About A New Subject
No ratings yet
Learn Words About A New Subject
20 pages
Anthony
No ratings yet
Anthony
33 pages
Lec14 CNNRNNModels
No ratings yet
Lec14 CNNRNNModels
64 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
SDL Unit 2 3 4
No ratings yet
SDL Unit 2 3 4
12 pages
TED TALK Manoush Zomorodi
100% (1)
TED TALK Manoush Zomorodi
2 pages
Juniji-Hogo by Zen Master Daichi Sokei Zenji
No ratings yet
Juniji-Hogo by Zen Master Daichi Sokei Zenji
3 pages
L10-DL Intro
No ratings yet
L10-DL Intro
25 pages
Failing COVID-19 Response A Failure of A Weak and Privatized
No ratings yet
Failing COVID-19 Response A Failure of A Weak and Privatized
14 pages
Intermediate Relational Database Certificate
No ratings yet
Intermediate Relational Database Certificate
1 page
Experiential Learning Presentation With Index
No ratings yet
Experiential Learning Presentation With Index
12 pages
Lecture - 07 (Convolutional Neural Networks)
No ratings yet
Lecture - 07 (Convolutional Neural Networks)
57 pages
6.1 DeepFFNets M2
No ratings yet
6.1 DeepFFNets M2
48 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Slat Result
No ratings yet
Slat Result
1 page
FDL_Module1
No ratings yet
FDL_Module1
102 pages
ML Unit 6
No ratings yet
ML Unit 6
2 pages
13-14 Midterm-Final Exam Semester 1 Bell Schedule Memo
No ratings yet
13-14 Midterm-Final Exam Semester 1 Bell Schedule Memo
6 pages
Intonation System. Tench
No ratings yet
Intonation System. Tench
11 pages
New Developments in The Bioarchaeology of Care Further Case Studies and Expanded Theory Complete Digital Book
100% (8)
New Developments in The Bioarchaeology of Care Further Case Studies and Expanded Theory Complete Digital Book
14 pages
Affirmations Creation Worksheet 1
100% (1)
Affirmations Creation Worksheet 1
4 pages
Arithmatic Circuit
No ratings yet
Arithmatic Circuit
7 pages
PGDM Brochure & Flyers at Gibs Bangalore - Top PGDM College in Bangalore - Business Management Programme
No ratings yet
PGDM Brochure & Flyers at Gibs Bangalore - Top PGDM College in Bangalore - Business Management Programme
19 pages
Matrix On Strategic Plan For Customs Development (SPCD)
No ratings yet
Matrix On Strategic Plan For Customs Development (SPCD)
4 pages
Unit 4 (CNN and SOM)[1]
No ratings yet
Unit 4 (CNN and SOM)[1]
15 pages
Systems of Linear Equations Matrices: Section 7 Leontief Input-Output Analysis
No ratings yet
Systems of Linear Equations Matrices: Section 7 Leontief Input-Output Analysis
19 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
Hjorland (2002) Análisis de Dominio en La CI. 11 Enfoques
No ratings yet
Hjorland (2002) Análisis de Dominio en La CI. 11 Enfoques
13 pages
Complete Name: End of The Program Evaluation
No ratings yet
Complete Name: End of The Program Evaluation
4 pages
CEM
No ratings yet
CEM
8 pages
Adijfpqo
No ratings yet
Adijfpqo
8 pages
Case 7 (Interviewee 1) Gulfraz Ahmed
No ratings yet
Case 7 (Interviewee 1) Gulfraz Ahmed
8 pages
Crisis Literacy in Indonesia
No ratings yet
Crisis Literacy in Indonesia
4 pages
Pots Resume
No ratings yet
Pots Resume
3 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
5 pages
Tgfu Lesson Plan
No ratings yet
Tgfu Lesson Plan
5 pages
Assignment 2 - Conflicting Viewpoint Part 2
No ratings yet
Assignment 2 - Conflicting Viewpoint Part 2
2 pages
Drama Lesson Plan
No ratings yet
Drama Lesson Plan
2 pages