0% found this document useful (0 votes)
11 views102 pages

13-RL DRL

Uploaded by

kamble793supriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views102 pages

13-RL DRL

Uploaded by

kamble793supriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Reinforcement Learning

Reinforcement Learning: An
Introduction
• Reinforcement Learning: An Introduction,
Richard S. Sutton and Andrew G. Barto, 2nd
Edition. This is available for free and
references will refer to the final pdf version
http://incompleteideas.net/book/the-book-
2nd.html
• http://incompleteideas.net/book/RLbook2020
.pdf
http://angelpawstherapy.org/positive-reinforcement-dog-training.html
Nature of Learning
• We learn from past experiences.
– When an infant plays, waves its arms, or looks about,
it has no explicit teacher
– But it does have direct interaction to its environment.
• Years of positive compliments as well as negative
criticism have all helped shape who we are today.
• Reinforcement learning: computational approach
to learning from interaction.
Richard Sutton and Andrew Barto, Reinforcement Learning: An
Introduction Nishant Shukla , Machine Learning with TensorFlow
Reinforcement Learning

https://www.cs.utexas.edu/~eladlieb/RLRG.html
Machine Learning, Tom Mitchell, 1997

http://www.cs.cmu.edu/~tom/mlbook.html
http://www.cs.cmu.edu/~tom/mlbo
ok.html
Frozen Lake

S F F F

F H F H

F F F H

H F F G
Frozen LakeWorld (OpenAI GYM)

S F F F
(1) Action (right, left, up, down)
F H F H

(2) state, reward F F F H

H F F G

Agent Environment
Frozen LakeWorld (OpenAI GYM)

S F F F
(1) action: RIGHT
F H F H

F F F H
(2) state: 1, reward: 0
H F F G

Agent Environment
https://gym.openai.com/
NEXT:Try Frozen Lake Real
Game?
S
Frozen Lake: Even if you know the way,ask.
Q-function (state-action value
function)

(1) state
(3) quality (reward)
(2) action

Q (state,action)
Policy using Q-function

Q (state,action)

Q (s1, LEFT):0
Q (s1, RIGHT):0.5
Q (s1, UP):0
Q (s1,DOWN): 0.3
Optimal Policy, and Max Q

Q (state,action)

Max Q =
Frozen Lake: optimal policy with Q

S
Finding, Learning Q

• Assume (believe) Q in s` exists!


• My condition
- I am in s
- when I do action a,I’ll go to s`
- when I do action a,I’ll get reward r
- Q in s`, Q(s`, a`) exist!
• How can we express Q(s,a) using Q(s`,a`)?
State, action,
reward
S F F F

F H F H

F F F H

H F F G
Future reward
Learning Q (s, a)?
Learning Q(s, a): 16x4Table
16 states and 4 actions (up, down, left, right)
Learning Q(s, a):Table
initial Q values are 0

0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
Learning Q(s, a)Table (with many trials)
initial Q values are 0
Learning Q(s, a) Table: one success!
initial Q values are 0

1 1
1

1
1

1 1
Learning Q(s, a)Table:optimal policy

1 1
1

1
1

1 1
Dummy Q-learning algorithm
Exploit VSExploration

1 1
1

1
1

1 1

http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf
ExploitVS Exploration: E-greedy
Epsilon-Greedy

e = 0.1
if rand < e:
a = random
else:
a = argmax(Q(s, a))
ExploitVS Exploration: decaying E-greedy

for i in range (1000)


e = 0.1 / (i+1)
if random(1) < e:
a = random

else:
a = argmax(Q(s, a))
ExploitVS Exploration: add random noise

a = argmax(Q(s, a) + random_values)

a = argmax([0.5 0.6 0.3 0.2 0.5]+[0.1 0.2 0.7 0.3 0.1])

0.5 0.6 0.3 0.2 0.5


ExploitVS Exploration: add random noise

for i in range (1000)


a = argmax(Q(s, a) + random_values / (i+1))

0.5 0.6 0.3 0.2 0.5


Discounted future reward

1 1 1
1

1
1 1

1 1
Learning Q (s, a) with discounted reward
Discounted future reward
Learning Q (s, a) with discounted
reward
Q-learning algorithm
Convergence

• In deterministic worlds
• In finite states
Machine Learning, Tom Mitchell, 1997
Windy Frozen Lake
S
Deterministic VS Stochastic(nondeterministic)

• In deterministic models the output of the model is fully


determined by the parameter values and the initial conditions initial
conditions
• Stochastic models possess some inherent randomness.
- The same set of parameter values and initial conditions will lead to an
ensemble of different outputs.
Stochastic (non-deterministic)
Our previous Q-learning does not work

Score over time: 0.0165


Why does not work in stochastic (non-
deterministic) worlds?

a
s
Stochastic (non-deterministic) world

• Solution?
- Listen
to Q (s`) (just a little bit)
- Update Q(s) little bit (learning rate)

• Like our life mentors


- Don’tjust listen and follow one mentor
- Need to listen from many mentors
Learning incrementally
Q(s, a) r + max Q(s0, a0)
a0

• Learning rate,
- = 0.1

Q(s, a) Q(s,a) + [r + max Q(s0, a0)]


a0
Learning with learning rate

Q(s, a) r + max Q(s0, a0)


a0

Q(s, a) (1 ↵)Q(s, a) + ↵[r + max Q(s0, a0)]


a0

Q(s, a) Q(s, a) + ↵[r + max0 Q(s0, a0) Q(s, a)]


a
Q-learning algorithm

a0
Convergence

Qˆ(s, a) (1 ↵)Qˆ(s, a) + ↵[r + max Qˆ(s 0, a0)]


0 a

Machine Learning, Tom Mitchell, 1997


Q-Table (?)

100x100 maze 80x80 pixel + 2 color (black/white)


Q-function
Approximation
(3) quality (reward)
for the given action
(1) state, s (eg, LEFT: 0.5)

(2) action, a
Q-function
Approximation
(2) quality (reward)
for all actions
(eg, [0.5, 0.1, 0.0, 0.8]
LEFT: 0.5,
RIGHT 0.1
(1) state, s UP: 0.0,
DOWN: 0.8)
Q-function
Approximation
Q-Network training (linear regression)
m
1 X
H(x) = Wx cost(W ) =
m
(Wx( i ) y( i ) ) 2
i=1

(1)s (2)Ws

Q(s)
Q-Network training (linear regression)

2
cost(W ) = (Ws y)
y = r + max Q(s0)
(1)s (2)Ws

Q(s)
Q-Network training (math
notations)
• Approximate Q*
function using ✓
(1)s (2)Ws

Q(s)
Q-Network training (math
notations)
• Approximate Q*
function using ✓
(1)s (2)Ws
Qˆ(s, a|✓)⇠ Q⇤(s,a)

• Choose ✓to minimize


XT
min ˆ t , at |✓)
[Q(s (r t + max Q(s
ˆ 0
t+1 , a |✓))]
2
✓ a0
t=0

Q(s)
http://introtodeeplearning.com/6.S091DeepReinforcementLearning.pdf
Algorithm
Convergence
Reinforcement + Neural Net
DQN: Deep, Replay, Separated
networks
Deep Reinforcement Learning
Reinforcement + Neural Net
DQN: Deep, Replay, Separated
networks
• Volodymyr Mnih Koray Kavukcuoglu David Silver
Alex Graves Ioannis Antonoglou Daan Wierstra
Martin Riedmiller
• DeepMind Technologies

https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Q-Nets are unstable
Two big issues

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind


Correlations between samples
1. Correlations between samples

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
1. Correlations between samples
2. Non-stationary targets

T
X
min [Q̂(st , at |θ) − (r t + γ maxQ̂(st+1 , a0|θ))]2
θ a0
t=
0

Ŷ = Q̂(st , at |θ) Y = r t + γ max Qˆθ (st+1 , a0|θ)


a0
DQN’s solutions
1. Go deep
2. Capture and replay
Correlations between samples
3. Separate networks: create a target network
Non-stationary targets
1. Go Deep
Problem 2: correlations between
samples
Solution 2: experience replay

s1, a1, r2, s2 random sample


s2, a2, r3, s3 & Replay
Capture
s3, a3, r4, s4 min
XT
ˆ t, at|✓)(r
[Q(s t + max Q(s
ˆ t+1, a|✓))]
0 2
✓ a0
t=0
...
st ,at , rt+1, s t+1

ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind
Solution 2: experience
replay

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
Problem 2: correlations between
samples

s1, a1, r2, s2


s2, a2, r3, s3
s3, a3, r4, s4
...
st ,at , rt+1, s t+1
Problem 3: Non-stationary targets

T
X
min [Q̂(st , at |θ) − (r t + γ maxQ̂(st+1 , a0|θ))]2
θ a0
t=
0

Ŷ = Q̂(st , at |θ) Y = r t + γ max Qˆθ (st+1 , a0|θ)


a0
Solution 3: separate target network
T
X
min [Q̂(st ,at |θ) − (r t + γ max Q̂(st+1 , a0|θ̄))]2
θ a0
t=0

(1)s (2)W s

(1)s (2) Y (target)


Solution 3: copy network
Understanding
Nature
Paper (2015)
DQN’s three
solutions

1. Go deep
2. Capture and replay
• Correlations between samples
3. Separate networks
• Non-stationary targets

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind


2.Train from replay memory

https://github.com/awjuliani/DeepRL-Agents
DQN’s three
solutions

1. Go deep
2. Capture and replay
• Correlations between samples
3. Separate networks
• Non-stationary targets

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind


Implementing
Nature Paper

Human-level control through deep reinforcement learning,


Nature
http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
Solution 3: separate target
X T
network
min ˆ t , at |✓) —(rt + max Q(s
[Q(s ˆ 0¯ 2
t+1 , a |✓))]
✓ a0
t=0

(1)s (2)W s

(1)s (2) Y (target)


DQN VS targetDQN
XT
min ˆ t, at|✓)(r
[Q(s t + max Q(s
ˆ 0¯ 2
t+1 , a|✓))]
✓ a0
t=0
Solution 3: copy network

(1)s (2)W s

(1)s (2) Y (target)


DQN works reasonably well
DQN
2013 VS
2015
• DQN implementations
- https://github.com/songrotek/DQN-Atari-Tensorflow
- https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/
dqn.py
- https://github.com/devsisters/DQN-tensorflow
- http://www.ehrenbrav.com/2016/08/teaching-your-computer-to-play-
super- mario-bros-a-fork-of-the-google-deepmind-atari-machine-
learning-project
Deep Reinforcement Learing
https://deepmind.com/blog/article/Agent57-
Outperforming-the-human-Atari-benchmark
Never Give Up (NGU)
From R2D2 to Never Give up

LSTM is replaced with Transformer.


Agent57
• Improve on NGU agent
• Meta-Controller allows each actor of the agent to
choose a different trade-off between near vs. long term
performance

https://deepmind.com/blog/article/Agent57-
Outperforming-the-human-Atari-benchmark
Papers
• https://deepmind.com/research/dqn/
• https://storage.googleapis.com/deepmind-
media/dqn/DQNNaturePaper.pdf
• https://www.cs.toronto.edu/~vmnih/docs/dq
n.pdf
• http://nn.cs.utexas.edu/downloads/papers/st
anley.gecco02_1.pdf
References
• Simple Reinforcement Learning with TensorFlow,
https://medium.com/ emergent-future/
• http://kvfrans.com/simple-algoritms-for-solving-
cartpole/ (written by a high school student)
• Deep Reinforcement Learning: Pong from Pixels -
Andrej Karpathy blog
http://karpathy.github.io/2016/05/31/rl/
• Machine Learning, Tom Mitchell, 1997
• Deep RL Bootcamp
• https://sites.google.com/view/deep-rl-
bootcamp/lectures
• Reinforcement Learning: An Introduction, Sutton and Barto,
2nd Edition. This is available for free here and references
will refer to the final pdf version available here.
• Some other additional references that may be useful are
listed below:
• Reinforcement Learning: State-of-the-Art, Marco Wiering
and Martijn van Otterlo, Eds. [link]
• Artificial Intelligence: A Modern Approach, Stuart J. Russell
and Peter Norvig. [link]
• Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron
Courville. [link]
• David Silver's course on Reinforcement Learning. [link]

You might also like