Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode
Reinforcement Learning
Introduction
What is Reinforcement
Learning?
Autonomous Helicopter
GPS
Accelerometers
Compass
Computer
How to fly it?
Andrew Ng
Autonomous Helicopter
[Thanks to Pieter Abbeel, Adam Coates and Morgan Quigley] For more videos: http://heli.stanford.edu.
Andrew Ng
Reinforcement Learning
position of helicopter how to move control sticks
state ! action "
# $
reward function
positive reward : helicopter flying well
negative reward : helicopter flying poorly
Andrew Ng
Robotic Dog Example
[Thanks to Zico Kolter]
Andrew Ng
Applications
• Controlling robots
• Factory optimization
• Financial (stock) trading
• Playing games (including video games)
Andrew Ng
Reinforcement
Learning formalism
Mars rover example
Mars Rover Example
terminal state terminal state
state
left right
state
[Credit: Jagriti Agrawal, Emma Brunskill]
Andrew Ng
Reinforcement
Learning formalism
The Return in
reinforcement learning
Return
100 0 0 0 0 40
state 1 2 3 4 5 6
Return
Return (until terminal state)
Discount Factor
Return
Andrew Ng
Example of Return
return
! = 0.5
100 0 0 0 0 40 reward
1 2 3 4 5 6
The return depends on the actions you take.
100 0 0 0 0 40
1 2 3 4 5 6
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
Reinforcement
Learning formalism
Making decisions: Policies
in reinforcement learning
Policy
100 40
policy
state action 100 40
100 40
100 40
A policy is a function 5 ! = " mapping from states to actions, that tells you
what action a to take in a given state s.
Andrew Ng
The goal of reinforcement learning
100 40
Find a policy 5 that tells you what action (a = 5(s)) to take in every state (s) so as to
maximize the return.
Andrew Ng
Reinforcement
Learning formalism
Review of key concepts
Mars rover Helicopter Chess
states 6 states position of helicopter pieces on board
how to move
actions possible move
control stick
rewards
discount factor 7
return 8! + 78" + 7 " 8# + ⋯ 8! + 78" + 7 " 8# + ⋯ 8! + 78" + 7 " 8# + ⋯
policy 5 100 40 Find 5 ! = " Find 5 ! = "
Andrew Ng
Markov Decision Process (MDP)
Agent
Agent
&
state s
reward R action a
Environment /
World
Andrew Ng
State-action value
function
State-action value function
definition
State action value function (Q-function)
= Return if you
• start in state !.
• take action " (once).
• then behave optimally after that.
100 50 25 12.5 20 40 return
action
100 0 0 0 0 40 reward
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
Picking actions
100 50 25 12.5 20 40 return
action
100 0 0 0 0 40 reward
100 100 50 12.5 25 6.25 12.5 10 6.25 20 40 40
100 0 0 0 0 40
1 2 3 4 5 6
! ", $ = Return if you
• start in state %.
• take action & (once).
• then behave optimally after that.
The best possible return from state ! is max < !, " .
$
'∗
The best possible action in state ! is the action " that gives max < !, " . Optimal < function
$
Andrew Ng
State-action value
function
State-action value function
example
Jupyter Notebook
Andrew Ng
State-action value
function
Bellman Equation
Bellman Equation
' (, * = Return if you
• start in state !.
1 2 3 4 5 6
• take action " (once).
• then behave optimally after that.
! : current state #(!) = reward of current state
" : current action
! ! : state you get to after taking action "
"! : action that you take in state !′
Andrew Ng
Bellman Equation
! ", $ = & " + ( max !(" ( , $( )
? '
100 100 50 12.5 25 6.25 12.5 10 6.25 20 40 40
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
Explanation of Bellman Equation
' (, * = Return if you
• start in state !.
• take action " (once).
• then behave optimally after that.
The best possible return from state ! is max * (! , " )
"
, ,
! ", $ = & " + ( max
!
!(" ,$ )
+
Reward you get Return from behaving optimally
right away starting from state # ! .
Andrew Ng
Explanation of Bellman Equation
! ", $ = & " + ( max !(" ( , $( )
? '
100 100 50 12.5 25 6.25 12.5 10 6.25 20 40 40
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
State-action value
function
Random (stochastic)
environment (Optional)
Stochastic Environment
1 2 3 4 5 6
Andrew Ng
Expected Return
100 0 0 0 0 40
1 2 3 4 5 6
Expected Return = Average( #% + -#& + - & #' + - ' #( + ⋯ )
= E[+! + -+" + - " +# + - # +$ + ⋯ ]
Andrew Ng
Expected Return
Goal of Reinforcement Learning:
Choose a policy / ! = " that will tell us what action " to take in state ! so as
to maximize the expected return.
, ,
Bellman ! ", $ = & " + ( .[max
!
! " ,$ ]
+
Equation:
Andrew Ng
Jupyter Notebook
Andrew Ng
Continuous State
Spaces
Example of continuous
state applications
Discrete vs Continuous State
Discrete State:
1 2 3 4 5 6
Continuous State:
0 6 km
Andrew Ng
Autonomous Helicopter
Andrew Ng
Continuous State
Spaces
Lunar Lander
Lunar Lander
Andrew Ng
Lunar Lander
actions:
do nothing
left thruster
main thruster
right thruster
Andrew Ng
Reward Function
• Getting to landing pad: 100 – 140
• Additional reward for moving toward/away from pad.
• Crash: -100
• Soft landing: +100
• Leg grounded: +10
• Fire main engine: -0.3
• Fire side thruster: -0.03
Andrew Ng
Lunar Lander Problem
Learn a policy ! that, given #
$
#̇
$̇
!=
&
&̇
'
(
picks action " = ! $ so as to maximize the return.
) = 0.985
Andrew Ng
Continuous State
Spaces
Learning the state-value
function
Deep Reinforcement
5
Learning
6
5̇
6̇
! 8
x= 8̇
9 #(!, ")
" :
1
0 '
0
0
12 inputs 64 units 64 units 1 unit
In a state $, use neural network to compute
%($, nothing), %($, left), %($, main), %($, right)
Pick the action a that maximizes %($, ")
Andrew Ng
Bellman Equation
& =' # + + max
!
/(#′ , %$ )
%
! "
! = # ,% &
(")
# (") , %(") , '(# (") ), # $
5 &
Andrew Ng
Learning Algorithm
Initialize neural network randomly as guess of %($, ").
Repeat {
Take actions in the lunar lander. Get ($, ", =($), $ ! ).
Store 10,000 most recent ($, ", =($), $ ! ) tuples.
Train neural network:
Create training set of 10,000 examples using
x = ($, ") and y = =($) + A max
!
%($ ! , "! ).
"
Train %#$% such that %#$% $, " ≈ 6.
Set % = %#$% .
[Mnih et al., 2015, Human-level control through deep reinforcement learning]
Andrew Ng
Continuous State
Spaces
Algorithm refinement:
Improved neural network
architecture
Deep Reinforcement
5
Learning
6
5̇
6̇
! 8
8̇
x= 9 #(!, ")
" :
1
0
0
0
12 inputs 64 units 64 units 1 unit
In a state $, use neural network to compute
%($, nothing), %($, left), %($, main), %($, right)
Pick the action a that maximizes %($, ")
Andrew Ng
Deep Reinforcement Learning
#
$
& %($, nothing)
#̇ %($, left)
! = $̇ %($, main)
%($, right)
&̇
'
8 inputs ( 64 units 64 units 4 units
In a state $, input $ to neural network.
Pick the action a that maximizes % $, " . = $ + A max
!
%($ ! , "! )
"
Andrew Ng
Continuous State
Spaces
Algorithm refinement:
!-greedy policy
Learning Algorithm
Initialize neural network randomly as guess of %($, ").
Repeat {
Take actions in the lunar lander. Get ($, ", =($), $ ! ).
Store 10,000 most recent ($, ", =($), $ ! ) tuples.
Train model:
Create training set of 10,000 examples using
x = ($, ") and y = =($) + A max
!
%($ ! , "! ).
"
Train %#$% such that %#$% $, " ≈ 6. D%.' 5 ≈ 6
Set % = %#$% .
Andrew Ng
How to choose actions while still learning?
In some state s
Option 1:
Pick the action " that maximizes %($, ").
Option 2:
With probability 0.95, pick the action a that maximizes %($, ").
With probability 0.05, pick an action " randomly.
(E = 0.05)
Andrew Ng
Continuous State
Spaces
Algorithm refinement:
Mini-batch and soft update
(optional)
How to choose actions while still learning?
$
1 ! ! (
! & 0 1, 3 = 7 8%,' # − $
26
!"#
2104 400
1416 232
1534 315
852 178 repeat { +
… … ' 1 ( ( -
# =#−& #, I
( 2H * D%,' 5 − 6
3210 870 '# ()*
+
' 1 ( ( -
#, I
* = * − & ( 2H * D%,' 5 − 6
'* ()*
}
Andrew Ng
V
V
Mini-batch
V
V price in $1000’s
V
500
! &
400
2104 400 300
1416 232
1534 315 200
batch
852 178
100
… …
3210 870 0
0 1000 2000 3000
size in feet2
Andrew Ng
Mini-batch
! & 0 w, 3
2104 400
1416 232
1534 315 K- K-
852 178 # bedrooms
… …
3210 870
K* K*
size in feet2
Andrew Ng
Learning Algorithm
Initialize neural network randomly as guess of %($, ").
Repeat {
Take actions in the lunar lander. Get ($, ", =($), $ ! ).
Store 10,000 most recent ($, ", =($), $ ! ) tuples.
Train model:
Create training set of 10,000 examples using
x = ($, ") and y = =($) + A max
!
%($ ! , "! ).
"
Train %#$% such that %#$% $, " ≈ 6.
Set % = %#$% .
Andrew Ng
Soft Update
Set ! = !;<= .
Andrew Ng
Continuous State
Spaces
The state of
reinforcement learning
Limitations of Reinforcement Learning
• Much easier to get to work in a simulation than a real robot!
• Far fewer applications than supervised and unsupervised
learning.
• But … exciting research direction with potential for future
applications.
Andrew Ng
Conclusion
Summary and
Thank you
Courses
• Supervised Machine Learning: Regression and Classification
Linear regression, logistic regression, gradient descent
• Advanced Learning Algorithms
Neural networks, decision trees, advice for ML
• Unsupervised Learning, Recommenders, Reinforcement Learning
Clustering, anomaly detection, collaborative filtering, content-
based filtering, reinforcement learning
Andrew Ng
Andrew Ng