13-RL DRL
13-RL DRL
Reinforcement Learning: An
Introduction
• Reinforcement Learning: An Introduction,
Richard S. Sutton and Andrew G. Barto, 2nd
Edition. This is available for free and
references will refer to the final pdf version
http://incompleteideas.net/book/the-book-
2nd.html
• http://incompleteideas.net/book/RLbook2020
.pdf
http://angelpawstherapy.org/positive-reinforcement-dog-training.html
Nature of Learning
• We learn from past experiences.
– When an infant plays, waves its arms, or looks about,
it has no explicit teacher
– But it does have direct interaction to its environment.
• Years of positive compliments as well as negative
criticism have all helped shape who we are today.
• Reinforcement learning: computational approach
to learning from interaction.
Richard Sutton and Andrew Barto, Reinforcement Learning: An
Introduction Nishant Shukla , Machine Learning with TensorFlow
Reinforcement Learning
https://www.cs.utexas.edu/~eladlieb/RLRG.html
Machine Learning, Tom Mitchell, 1997
http://www.cs.cmu.edu/~tom/mlbook.html
http://www.cs.cmu.edu/~tom/mlbo
ok.html
Frozen Lake
S F F F
F H F H
F F F H
H F F G
Frozen LakeWorld (OpenAI GYM)
S F F F
(1) Action (right, left, up, down)
F H F H
H F F G
Agent Environment
Frozen LakeWorld (OpenAI GYM)
S F F F
(1) action: RIGHT
F H F H
F F F H
(2) state: 1, reward: 0
H F F G
Agent Environment
https://gym.openai.com/
NEXT:Try Frozen Lake Real
Game?
S
Frozen Lake: Even if you know the way,ask.
Q-function (state-action value
function)
(1) state
(3) quality (reward)
(2) action
Q (state,action)
Policy using Q-function
Q (state,action)
Q (s1, LEFT):0
Q (s1, RIGHT):0.5
Q (s1, UP):0
Q (s1,DOWN): 0.3
Optimal Policy, and Max Q
Q (state,action)
Max Q =
Frozen Lake: optimal policy with Q
S
Finding, Learning Q
F H F H
F F F H
H F F G
Future reward
Learning Q (s, a)?
Learning Q(s, a): 16x4Table
16 states and 4 actions (up, down, left, right)
Learning Q(s, a):Table
initial Q values are 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
Learning Q(s, a)Table (with many trials)
initial Q values are 0
Learning Q(s, a) Table: one success!
initial Q values are 0
1 1
1
1
1
1 1
Learning Q(s, a)Table:optimal policy
1 1
1
1
1
1 1
Dummy Q-learning algorithm
Exploit VSExploration
1 1
1
1
1
1 1
http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf
ExploitVS Exploration: E-greedy
Epsilon-Greedy
e = 0.1
if rand < e:
a = random
else:
a = argmax(Q(s, a))
ExploitVS Exploration: decaying E-greedy
else:
a = argmax(Q(s, a))
ExploitVS Exploration: add random noise
a = argmax(Q(s, a) + random_values)
1 1 1
1
1
1 1
1 1
Learning Q (s, a) with discounted reward
Discounted future reward
Learning Q (s, a) with discounted
reward
Q-learning algorithm
Convergence
• In deterministic worlds
• In finite states
Machine Learning, Tom Mitchell, 1997
Windy Frozen Lake
S
Deterministic VS Stochastic(nondeterministic)
a
s
Stochastic (non-deterministic) world
• Solution?
- Listen
to Q (s`) (just a little bit)
- Update Q(s) little bit (learning rate)
• Learning rate,
- = 0.1
a0
Convergence
(2) action, a
Q-function
Approximation
(2) quality (reward)
for all actions
(eg, [0.5, 0.1, 0.0, 0.8]
LEFT: 0.5,
RIGHT 0.1
(1) state, s UP: 0.0,
DOWN: 0.8)
Q-function
Approximation
Q-Network training (linear regression)
m
1 X
H(x) = Wx cost(W ) =
m
(Wx( i ) y( i ) ) 2
i=1
(1)s (2)Ws
Q(s)
Q-Network training (linear regression)
2
cost(W ) = (Ws y)
y = r + max Q(s0)
(1)s (2)Ws
Q(s)
Q-Network training (math
notations)
• Approximate Q*
function using ✓
(1)s (2)Ws
Q(s)
Q-Network training (math
notations)
• Approximate Q*
function using ✓
(1)s (2)Ws
Qˆ(s, a|✓)⇠ Q⇤(s,a)
Q(s)
http://introtodeeplearning.com/6.S091DeepReinforcementLearning.pdf
Algorithm
Convergence
Reinforcement + Neural Net
DQN: Deep, Replay, Separated
networks
Deep Reinforcement Learning
Reinforcement + Neural Net
DQN: Deep, Replay, Separated
networks
• Volodymyr Mnih Koray Kavukcuoglu David Silver
Alex Graves Ioannis Antonoglou Daan Wierstra
Martin Riedmiller
• DeepMind Technologies
https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Q-Nets are unstable
Two big issues
Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
1. Correlations between samples
2. Non-stationary targets
T
X
min [Q̂(st , at |θ) − (r t + γ maxQ̂(st+1 , a0|θ))]2
θ a0
t=
0
ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind
Solution 2: experience
replay
Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
Problem 2: correlations between
samples
T
X
min [Q̂(st , at |θ) − (r t + γ maxQ̂(st+1 , a0|θ))]2
θ a0
t=
0
(1)s (2)W s
1. Go deep
2. Capture and replay
• Correlations between samples
3. Separate networks
• Non-stationary targets
https://github.com/awjuliani/DeepRL-Agents
DQN’s three
solutions
1. Go deep
2. Capture and replay
• Correlations between samples
3. Separate networks
• Non-stationary targets
(1)s (2)W s
(1)s (2)W s
https://deepmind.com/blog/article/Agent57-
Outperforming-the-human-Atari-benchmark
Papers
• https://deepmind.com/research/dqn/
• https://storage.googleapis.com/deepmind-
media/dqn/DQNNaturePaper.pdf
• https://www.cs.toronto.edu/~vmnih/docs/dq
n.pdf
• http://nn.cs.utexas.edu/downloads/papers/st
anley.gecco02_1.pdf
References
• Simple Reinforcement Learning with TensorFlow,
https://medium.com/ emergent-future/
• http://kvfrans.com/simple-algoritms-for-solving-
cartpole/ (written by a high school student)
• Deep Reinforcement Learning: Pong from Pixels -
Andrej Karpathy blog
http://karpathy.github.io/2016/05/31/rl/
• Machine Learning, Tom Mitchell, 1997
• Deep RL Bootcamp
• https://sites.google.com/view/deep-rl-
bootcamp/lectures
• Reinforcement Learning: An Introduction, Sutton and Barto,
2nd Edition. This is available for free here and references
will refer to the final pdf version available here.
• Some other additional references that may be useful are
listed below:
• Reinforcement Learning: State-of-the-Art, Marco Wiering
and Martijn van Otterlo, Eds. [link]
• Artificial Intelligence: A Modern Approach, Stuart J. Russell
and Peter Norvig. [link]
• Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron
Courville. [link]
• David Silver's course on Reinforcement Learning. [link]