0% found this document useful (0 votes)
20 views143 pages

Maai 6

This document discusses single-agent reinforcement learning. It begins with an overview of the single-agent reinforcement learning framework where an agent interacts with an environment by selecting actions and perceiving rewards and new states. It then covers model-based and model-free reinforcement learning methods, including planning with dynamic programming, Monte-Carlo prediction, temporal difference learning, and Q-learning. Deep reinforcement learning examples are also mentioned, such as using reinforcement learning to play Atari games and AlphaGo defeating human champions at Go.

Uploaded by

Tùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views143 pages

Maai 6

This document discusses single-agent reinforcement learning. It begins with an overview of the single-agent reinforcement learning framework where an agent interacts with an environment by selecting actions and perceiving rewards and new states. It then covers model-based and model-free reinforcement learning methods, including planning with dynamic programming, Monte-Carlo prediction, temporal difference learning, and Q-learning. Deep reinforcement learning examples are also mentioned, such as using reinforcement learning to play Atari games and AlphaGo defeating human champions at Go.

Uploaded by

Tùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

COMP0124 Multi-agent Artificial Intelligence

Single-agent
Reinforcement Learning
(Value-based approaches)
Prof. Jun Wang
Computer Science, UCL
Recap
• Lecture 1: Multiagent AI and basic game theory
• Lecture 2: Potential games, and extensive form and
repeated games
• Lecture 3: Solving (“Learning”) Nash Equilibria
• Lecture 4: Bayesian Games, auction theory and
mechanism design
• Lecture 5: Learning and deep neural networks
• Lecture 6: Single-agent Learning (1)
• Lecture 7: Single-agent Learning (2)
• Lecture 8: Multi-agent Learning (1)
• Lecture 9: Multi-agent Learning (2)
• Lecture 10: Multi-agent Learning (3)
Subject to change
learning in single agent

• A typical AI concerns the learning performed by


an individual agent
• In that setting, the goal is to design an agent
that learns to function successfully in an
environment that is unknown and potentially
also changes as the agent is learning
– Learning to recommend in collaborative filtering
– Learning to predict click-through rate by logistic
regression
– Learning to take actions: reinforcement learning
Learning to take actions: AI plays Atari Games

• Rules of the game


are unknown
• Learn from
interactive game-play
• Pick actions on
joystick, see pixels
and scores

Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:1312.5602, 2013.
AlphaGo vs. the world’s ‘Go’ champion
2016

http://www.goratings.org/
Coulom, Rémi. "Whole-history rating: A bayesian rating system for
players of time-varying strength." Computers and games. Springer
Berlin Heidelberg, 2008. 113-124.
Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
• Convergence Analysis
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
– Convergence Analysis
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
Reinforcement Learning
• Reinforcement learning (RL)
Chapter 2 to take actions over an environment so as to maximize some
– learning
numerical value which represents a long-term objective.
• The single agent reinforcement learning framework
Single-Agent Framework
– A agent receives an environment’s state and a reward associated with the last
state transition.
– It then calculates an action which is sent back to the system.
– Inreinforcement
The single agent response,learning
the system makes
framework a transition
is based to a new state and the cycle is
on the model
of Figure 2.1, where an
repeated.agent interacts with the environment by selecting
actions to take and then perceiving the e↵ects of those actions, a new state
Theindicating
and a reward–signal problem if itishas
to reached
learn asome
waygoalof (or
controlling
has been pe-the environment so as to
maximize
nalized, if the reward the The
is negative). totalobjective
reward. of the agent is to maximize

Agent Neto, Gonçalo. "From single-


agent to multi-agent
state reward action
reinforcement learning:
Foundational concepts and
methods." Learning theory
Environment
course (2005).
Reinforcement Learning
• Learning from interaction
– Given the current situation, what to do next in
order to maximize utility?
Agent

Observation Action

Reward
Reinforcement Learning Definition
• A computational approach by learning from
interaction to achieve a goal Agent

Observation Action

• Three aspects Reward

– Sensation: sense the state of the environment to


some extent
– Action: able to take actions that affect the state and
achieve the goal
– Goal: maximize the cumulated reward
Reinforcement Learning
Agent

• At each step t, the agent


– Executes action At
– Receives observation Ot
– Receives scalar reward Rt

• The environment
– Receives action At
– Emits observation Ot+1
– Emits scalar reward Rt+1

• t increments at each
environment step
Environment
Elements of RL Systems
Elements of RL Systems
• Elements
History is the of RL Systems
sequence of observations, action, rewards
• History is the sequence of observations, action, rewards
Ht = O1 ; R1 ; A1 ; O2 ; R2 ; A2 ; : : : ; Ot¡1 ; Rt¡1 ; At¡1 ; Ot ; Rt
• History is the sequence of observations, action, rewards
• i.e. all observable variables up to time t
i.e.
• H all observable
t = •OE.g.,
1 ; Rthe
1; A
variables A2 ; :up
1 ; O2 ; R2 ;stream
sensorimotor : :a; to
of
time
Ot¡1
robot ;R
or
t
t¡1 ; At¡1
embodied ; Ot ; R t
agent
• E.g., the sensorimotor stream of a robot or embodied agent
• What happens next depends on the history:
• i.e. all observable variables up to time t
• The agent selects actions
• What happens
• E.g., the next stream
sensorimotor depends on or
of a robot the history:
embodied
• The environment selects observations/rewards
agent
• The
• What agent
• happens
State selects
next
is the actions
depends
information ontothe
used history:what happens
determine
Thenext
•• The (actions,
selects observations,
environment
agent rewards)
selects observations/rewards
actions
• Formally,
• The stateselects
environment is a function of the history
observations/rewards
• State is the information Sused to) determine what happens
• State is the information usedt to determine
= f (H what happens
next (actions, observations, rewards)
t

next (actions, observations, rewards)


• •Formally, stateis is
Formally, state a function
a function ofhistory
of the the history
St = f (Ht )
Elements of RL Systems
Elements of RL Systems
• Policy is the learning agent’s way of behaving at
a given
• Policy time
Elements
is of RL
the learning Systems
agent’s way of behaving at a
– Itgiven
is atime
map from state to action
• Policy
• It is the learning
state toagent’s
actionway of behaving at a
– Deterministic
is a map
given time
from
policy
• Deterministic policy
• It is a map from state to action
• Deterministic policy
a = ¼(s)
• Stochasticpolicy
• Stochastic policy a = ¼(s)
• Stochastic policy
¼(ajs) = P (At = ajSt = s)
¼(ajs) = P (At = ajSt = s)
Elements of RL Systems
Elements of RL Systems
• Reward
• Reward
– A scalar defining the goal in an RL problem
• A scalar defining the goal in an RL problem
– For immediate sense of what is good
• For immediate sense of what is good

• Value function
•–Value function
State value is a scalar specifying what is good in the
• State
long runvalue is a scalar specifying what is good in the long
run
– Value function is a prediction of the cumulated future
• Value function is a prediction of the cumulated future
reward
reward
• Used to evaluate the goodness/badness of states (given the
• Used policy)
current to evaluate the goodness/badness of states (given the
current policy)
v¼ (s) = E¼ [Rt+1 + °Rt+2 + ° 2 Rt+3 + : : : jSt = s]
Elements of RL Systems
Elements of RL Systems
Elements of RL Systems
• A Model
• A Model of the
of the environment
environment
that reflects
that mimics the the behavior
behavior ofof
• Athe
Model the
of environment
environment
the environment
that• mimics thethe
behavior of
– Predict
Predict
the environment
the
next next
statestate
•PPredict
a
= the next
P[S = state
s 0
jSt = s; At = a]
ss 0 t+1
a 0
Pss 0 = P[St+1 = s jSt = s; At = a]
• Predicts the next
– Predicts
• Predicts the next the next
(immediate) reward
(immediate)
(immediate) reward reward
a
R
Ra =
= E[RjSt+1=jSs;t A= =s;a]At = a]
s E[R
s t+1 t t

The agent may know the model of the environment or may not!
Maze Example

• State: agent’s location


• Action: N,E,S,W
Maze Example

• State: agent’s location


• Action: N,E,S,W
• State transition: move
to the next grid
according to the action
– No move if the action is
to the wall
Maze Example

• State: agent’s location


• Action: N,E,S,W
• State transition: move
to the next grid
according to the action
• Reward: -1 per time
step
Maze Example

• State: agent’s location


• Action: N,E,S,W
• State transition: move
to the next grid
according to the action
• Reward: -1 per time
step

• Given a policy as shown above


• Arrows represent policy π(s) for each state s
Maze Example

-14 -13 -12 -11 -10 -9 • State: agent’s location


-16 -15 -12 -8 • Action: N,E,S,W
-16 -17 -6 -7 • State transition: move
to the next grid
-18 -19 -5
according to the action
-24 -20 -4 -3
• Reward: -1 per time
-23 -22 -21 -22 -2 -1 step

• Numbers represent value vπ(s) of each state s


Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
• Convergence Analysis
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
– Convergence Analysis
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
Markov Decision Process
• Markov decision processes (MDPs) provide a
mathematical framework for modeling decision
making in situations where outcomes are partly
random and partly under the control of a decision
maker.

• MDPs formally describe an environment for RL


– where the environment is FULLY observable
– i.e. the current state completely characterizes the
process (Markov property)
Markov Property
Markov
“The Propertyof the past given the present”
future is independent
“The future is independent of the past given the present”

• Definition
• Definition
– A •state S is Markov if and only if
A state tS is Markov if and only if
t

P[St+1 jSt ] = P[St+1 jS1 ; : : : ; St ]

• Properties
• Properties
• The • The
state captures
state capturesall
all relevant information
relevant information fromfrom
the the
history
history
• Once the state is known, the history may be thrown away
• Once the state is known, the history may be thrown away
• i.e. the state is sufficient statistic of the future
i.e. the state is sufficient statistic of the future
Markov
Markov Decision
Decision ProcessProcess
• • AA Markov decisionprocess
Markov decision processis isa tuple
a tuple
(S,(S,
A, A,
{Psa{P
},saγ,},R)γ, R)
• • SS is the set
set of
ofstates
states
–• E.g.,
E.g., location
location in
in aamaze,
maze,ororcurrent
currentscreen in in
screen an an
Atari game
Atari game
• • AA is
is the
the set
set of
ofactions
actions
• E.g., move N, E, S, W, or the direction of the joystick and the
– E.g., move N, E, S, W, or the direction of the joystick and the buttons
buttons
• • PPsa are the state
are the statetransition
transitionprobabilities
probabilities
sa
–• For
For each
each state
state ss ∈∈SSand
andaction
actiona a∈ ∈A,A,
PsaPis isdistribution
sa a a distribution
overover
the the next
state in S in S
next state
• • γγ ∈[0,1] isis the
thediscount
discountfactor
factorforfor the
the future
future reward
reward
• • R : S £ A 7! R isisthethereward
rewardfunction
function
–• Sometimes
Sometimes the
the reward
rewardisisonly
onlyassigned
assignedto to
state
state
Markov Decision Process
Markov Decision Process
TheMarkov
MarkovofDecision
dynamics Process
an MDP proceeds
Decision as
Process
The dynamics of an MDP proceeds as
• Start in a
The dynamics
state
of an
s0MDP proceeds as
• Startofinan
The dynamics a state
MDPs0proceeds as
• The agent
• Start chooses
in a•state
The s0
agent
some action
chooses some
a0 ∈aA∈ A
action
• Start in a state s0 0
• The
• Theagent
agent • getsagent
chooses
The the reward
some
gets action
the R(s
a0 ∈0,a
reward 0)0,a0)
AR(s
• The agent chooses some action a0 ∈ A
• The agent gets the rewardtransits
R(s0,a0)to some successor state s » P
• MDP
• Therandomly

agent MDP
gets transits
randomly
the reward toR(s
some,a ) successor state 1 s0 a0
• MDP randomly transits to some successor state s1 » Ps0 a0
0 0
• This • This proceeds
proceeds iterativelyiteratively

• This proceeds iteratively ato some successor
MDP randomly transits 0 a 1
state
a
s1 » Ps0 a0
2
s0 ¡¡¡¡ ¡! s1 ¡¡¡¡ ¡! s2 ¡¡¡¡ ¡! s3 ¢ ¢ ¢
• This proceeds a0iterativelyR(s a;a
0 1 0) R(s1
a;a
s0 ¡¡¡¡¡! s1 ¡¡¡¡¡! s2 ¡¡¡¡¡! s3 ¢ ¢ ¢ 2 )
2 1) R(s2 ;a
R(s0 ;aa
0 )0 R(s1 ;a1a) 1 R(s2 ;a2 )a2
s
• Until ¡ ¡ ¡ ¡ ¡
0 a terminal! s1 state s ors2proceeds
¡ ¡¡ ¡ ¡! ¡¡¡¡¡!with
s3 ¢ ¢ no
¢ end
R(s ;a ) R(s ;a T) R(s ;a )
• Until a terminal state
0 0 1 1
sT or proceeds with no2 2
end
• Until a terminal
• The state ssT or
total payoff of proceedswith
the agent is with no end
• •The
Until a terminal
total payoff of state
the agent
T or proceeds
is 2
no end
R(s0 ; a0 ) + °R(s1 ; a1 ) + ° R(s2 ; a2 ) + ¢ ¢ ¢
• The total payoff
• The total
R(spayoff of the
0 ; a0 ) +of the1agent
°R(s agent °is
; a1 ) + is2
R(s2 ; a2 ) + ¢ ¢ ¢
R(s0 ; a0 ) + °R(s1 ; a1 ) + ° 2 R(s2 ; a2 ) + ¢ ¢ ¢
MDP MDP
Goal Goal
and and
Policy
MDP Goal and Policy Policy
MDP
MDPGoal
Goaland
and Policy
Policy
•• The
The goalgoal
• goal
The isistoto choose
ischoose
to actions
actions
choose over
actionsover timeto
over time
time tomaximize
to maximize
maximize the
the
the
MDP Goal and Policy
expected
expected
MDP
• The
expected cumulated
goalcumulated
Goal
is and
tocumulated
choose reward
reward
Policy
reward
actions over time to maximize the
• The goal is to choose actions over time to maximize the
expected
expectedcumulated reward
cumulated
E[R(s 0 0 )reward
)
E[R(s + °R(s
+ °R(s1 )1 )++°°22R(s
R(s22)) +
+¢ ¢¢ ¢ ¢] ]
• The goal is to choose actions over time
2 to maximize the
The goal is to choose E[R(s
E[R(s0 0))++
actions °R(s
over 11)) +
°R(stime + °° 2to
R(s
R(s ) + ¢ ¢ ¢ ] the
maximize
22) + ¢ ¢ ¢ ]

• γγ • γ ∈[0,1]
expected
∈[0,1]
∈[0,1] is
is cumulated
theis the
the discount
discount reward
discount factorfor
factor
factor forthe
for thefuture
the future
future reward,
reward,
reward, which
which
which
expected cumulated
• γ• ∈[0,1]
γmakes
reward
the agent prefer immediate reward to
∈[0,1]
theisagent
the discount factor for the future tofuture
reward, reward
which
is the discount factor 2
for future
¢ ¢ ¢ ] reward, which
makes
makes
makes
makes
the
• Inthe
agent
the
E[R(s
agent
agent
finance
prefer
prefer
0 ) + °R(s
prefer
prefer
case,
immediate
immediate
1) + °
immediate
R(s 2) +
$1 is morereward
immediate
today’s
reward
reward
valuableto
reward to
to future
future
future reward
$1 inreward
future
than reward
reward
tomorrow
2
•• E[R(s
•In
Inγ finance
∈[0,1]
finance

) is+case,
0finance °R(s
case,
the 1 ) +factor
today’s
today’s
discount °$1
$1R(s
is for
• In finance case, today’s $1 is more valuable than
In case, today’s $1 is 2)
more
ismore
morethe+valuable
¢ ¢ ¢ ] reward,
valuable
future
valuable than
than
than$1
$1
$1inintomorrow
which
$1inintomorrow
tomorrow
tomorrow
• makes
Given the a particular
agent prefer policy
immediate ¼(s)reward : S 7!toAfuture reward
γ ∈[0,1]•• is
Given
Given
• Given
•the
Given •a
•aInparticular
aparticular
discount aparticular
particular
finance
i.e. take factor
case,
the policy
policy
policy
policy
today’s
action for
$1
a= ¼(s)
the
¼(s)
is¼(s)
more
¼(s) at: 7!
future
::valuable
SS SA
7! 7!
stateA s A
reward,
than which
$1 in tomorrow
makes the ••agent • take
i.e. take the action a a= = ¼(s) at atstate s tos future reward
••i.e.
Define the value function for ¼ s s
i.e.
•Giventake
i.e. prefer
take
a the
the
theaction
immediate
action
action
particular a
policy = ¼(s)
¼(s)
¼(s) at
: at
reward
S state
state
!
7 state
A
• In finance • case,
Define
• i.e. the value function for ¼ sthan2 $1 in tomorrow
•• Define
• Define
Define the
thetoday’s
thetake
V¼ value
the
¼ value
value$1 is
action a = ¼(s)
more
function
function
function
(s) = E[R(s valuable
at ¼state
for) +¼
for
0 ) + °R(sfor 1) + 2 ° R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]
• Define V (s) = E[R(s
¼ the value function for ¼
0 ) + °R(s 1 ° 2 R(s 2 ) + ¢ ¢ ¢ js0 = s; ¼]
¼
V (s) = E[R(s ) + °R(s ) + ° R(s ) + ¢ ¢ ¢ js = s; ¼] 2
Given a particular
V • (s) policy
i.e. ==E[R(s
expected
V ¼expected
• i.e. (s)
¼(s)
E[R(s
: S°R(s
00 ) +
cumulated
0 ) + °R(s
cumulated
7! 1 A1°)2+
reward ° ) R(s
given
1 ) +given
reward R(s 2 +
the
2 start
the
start
) + s;
¢0 ¢ ¢ and
js0 =state
¢ ¢ ¢2
state and¼]
js0 = s; ¼]
takingtaking
• i.e. take the• action actions
i.e.actionsa =according
¼(s)
according
expected toatto
cumulated ¼reward
¼ state s given the start state and taking
•• i.e. • expected
i.e.actions
expected cumulated
i.e. expected to ¼ reward
cumulated
cumulated
according givengiven
reward
reward given the
the
the start start
andstate
start
state stateand
taking andtaking
taking
actionsactions according to
according to¼¼ ¼
actions
Define the value according
function to
for
V ¼ (s) = E[R(s ) + °R(s ) + ° 2 R(s ) + ¢ ¢ ¢ js = s; ¼]
makes the agent prefer immediate reward to future reward
• In finance case, today’s $1 is more valuable than $1 in tomorrow
Bellman
Bellman Equation
• Given a particular
Equation policy ¼(s)for
for Value
: S 7!
Value A Function
Function
• i.e. take the action a = ¼(s) at state s
•• Define
Define the
the value
value function
function for
for ¼
• Define the value function for ¼
V (s) = E[R(s0 ) + °R(s1 ) + ° 2 R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]
¼

V ¼ (s) = E[R(s0 ) + °R(s1 ) + ° 2 R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]


• i.e. expected cumulated {z the start}state and taking
| reward given
actions according to ¼ °V ¼ (s1 )
X
= R(s) + ° Ps¼(s) (s0 )V ¼ (s0 ) Bellman Equation
s0 2S a necessary condition for optimality

Immediate State Value of


Reward transition the next
state
Time
decay

Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., & Bertsekas, D. P. (1995). Dynamic
programming and optimal control (Vol. 1, No. 2). Belmont, MA: Athena scientific.
Optimal
Optimal ValueValue Function
Function
MDP Goal and Policy
Optimal
• The Optimal
optimal
• The optimal
Value
Value
value
valuefunction
Function
Function
function for eachstate
for each states sisisbest
best possible
possible
Optimal
sum of
• sum
The of Value
discounted
discounted
goal is to Function
rewards
rewards
choose that can
actions can
over betime
be attained
attained byby
anyany
to maximize
• The optimal value function for each state s is best possible
policy
the
• The optimal value function for each¼state s is best possible
policy
expected cumulated
sum of discounted ¤
sum of discountedVrewards
rewardsreward
(s) = that
max
that can V (s)attained
be attained
can be by any policy
by any policy
• The optimal value function for each ¼ state s is best possible
V ¤¤(s) = max V ¼ (s) 2
• The
sum of discounted rewards
Bellman’sE[R(s 0 ) that
equation
V (s) + ¼ can
°R(s
=for
max ¼ attained
Vbe
1) +
optimal(s)° R(sby
value 2 )any
+ ¢policy
function¢¢]
• The Bellman’s
• The Bellman’s equation
equation
¤
foroptimal
for optimal
¼¼ X valuevalue
functionfunction
•• The V ¤
¤(s) =
Bellman’s Vequation
(s)
R(s)=+ max max
for VX °(s) value
optimal P
0 sa¤(s
0 ¤ 0
0 )V (s )
function
γ ∈[0,1]V is(s)the discount
= R(s) + max¼ ° factor
a2A X Psa (sfor)V the (s ) future reward, which
• Themakes
Bellman’s
V ¤theequation
(s) = R(s)for
agent
a2A
+ optimal
prefer
max 0 s°2Svalue
0
s 2S
immediate P function
(s 0
)V ¤ 0
reward
(s ) to future reward
• TheVoptimal
• The optimal policy
policy a2A
X sa
¤ 0 2S(s0 )V ¤ (s0 )
(s) =
• In finance R(s) + max ° X X P
• The optimal
• The optimal
policy
¼
¼ policy
case, a2A
¤¤ (s) = arg
(s) = arga2A
today’s
maxmax 0
s
$1P issa
sa
more
(s 0
)V ¤ valuable
00 ) ¤ 0
(s
Psa (s )V (s )
than $1 in tomorrow
s 2S
X
• Given
0
• The optimalapolicy
particular a2A
s 2S
policy
¼ ¤
(s) = arg max
X
¼(s)
s0 2SP : 0 )V
(s S 7! ¤ 0A
(s )
• For every state s and every policy ¼ sa
• For• every
¤ a2A 0 ¤ 0
i.e.
¼ take
(s) =the
state argsaction
maxevery
¤and
a¼=sP0 2S
¤ ¼(s)(s )V
policy
sa ¼
at¼state
(s ) s
V (s)a2A= V0 (s) ¸ V (s)
• For•• every
Define
For everystate
the
state s and
value
s and¤ every
s 2S ¤
function
V (s) = Vpolicy
every policy
¼ for¼¼ ¼
(s) ¸ V (s)
• For every state
¼
s and ¤every policy
¼¤
¼ ¼ 2
V (s) = ¤
VE[R(s
(s) =
0
¼
)
¤ V+ °R(s
(s) ¸
¼ 1 )
V + °
(s) R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]
V (s) = V (s) ¸ V (s)
• i.e. expected cumulated reward given the start state and taking
actions according to ¼
ValueIteration
Value Iteration&&Policy
PolicyIteration
Iteration
Note
• •Note that
that the
the value
value function
function and
and policyare
policy arecorrelated
correlated
X
¼
V (s) = R(s) + ° Ps¼(s) (s0 )V ¼ (s0 )
s0 2S
X
¼(s) = arg max Psa (s0 )V ¼ (s0 )
a2A
s0 2S

• •ItItisisfeasible
feasibletotoperform
performiterative
iterativeupdate
updatetowards
towardsthe
theoptimal
optimal
valuefunction
value functionandandoptimal
optimalpolicy
policy
Valueiteration
• •Value iteration
Policyiteration
• •Policy iteration
Value Value Iteration
Iteration
Value Iteration
anFor
•• For
• For anan
MDP MDP
with
MDP with
finite
with finite
state
finite state
andand
state and action
action spaces
spaces
action spaces
jSj jSj < 1;
< 1; jAjjAj
<1 <1
Value
•• Value
• Value iteration
iteration
iteration isisperformed
performed
is performed as asas
1. 1. 1. For
For Foreach
each each state
state
state s,s,initialize
initialize
s, initialize V(s)
V(s)
V(s) = 0.
= 0.
= 0.
2. 2. 2. Repeat
RepeatRepeat until
until
until convergence
convergence
convergence { {{
Foreach
For For
each each state,
state,
state, update
update
update
XX 0 0
V (s) = R(s) +
V (s) = R(s) + max max
° ° P sa0(s )V (s
0 )
Psa (s )V (s )
a2A
a2A s0 2S
s0 2S
} }}
Notethat
•• Note thatthere
thereisisno
noexplicit
explicitpolicy
policyinin above
above calculation
calculation
• Note that there is no explicit policy in above calculation
Value Iteration Example: Shortest Path
Policy
Policy Policy
Iteration
Iteration
Policy Iteration
Iteration
Value Iteration & Policy Iterat
•For
• For• an
For an
an
MDP
For MDP
anMDP
MDP withfinite
withwith
finite
with finite
finite
state state
state
and
state andand
and • action
action
action
action spaces
Note spaces
spaces
that
spaces the value function and policy are correla

jSj X
jSj
jSj < <<1;
1; 1;
jAj <jAj
jAj <
11<1 V ¼ (s) = R(s) + ° Ps¼(s) (s0 )V ¼ (s0 )
s0 2S

• Policy
Policy
• Policy iteration
• Policyiteration
iteration isperformed
iteration is
is performed
is performed as
as asas
performed ¼(s) = arg max
X
Psa (s0 )V ¼ (s0 )
a2A 0
s 2S
1. Initialize π randomly
1. 1. Initialize
1. Initialize
Initialize π ππ randomly
randomly
randomly
2. Repeat until convergence { • It is feasible to perform iterative update towards th
2. 2.
RepeatRepeat
2. Repeat
a)until until
Letuntil convergence
V :=convergence
convergence
V¼ { {{ value function and optimal policy
¼ • Value iteration
a) Let
a) Letb) VFor V
:=each:=
¼ V
V state, update • Policy iteration
b) Forb) each each
For each
state, state,
state, update P (s0 )V (s0 )
update
¼(s)update
X
= arg max X sa
Xs0 2S
a2A
¼(s) = arg max P0 sa (s0 )V
0 (s0 )
} ¼(s) = arg max a2A Psa (s )V (s )
a2A 0
s0 2S s 2S
•} The}step of value function update could be time-consuming

• The• step
The of
step value
of value
value function
function
function update
update
update could
couldcould betime-consuming
be time-consuming
be time-consuming
Policy Iteration
Policy Iteration

•• Policy
Policy evaluation
evaluation
–• Estimate
Estimate VVππ
–• Iterative policy
Iterative policy evaluation
evaluation
Policy improvement
•• Policy improvement
–• Generate ¼ 0 ¸ ¼
Generate
–• Greedy policy
Greedy policy improvement
improvement
Evaluating a Random Policy in the Small Gridworld
Evaluating a Random Policy in the Small Gridworld
r = -1
on all transitions
r = -1
on all transitions

• Undiscounted episodic MDP (γ=1)


• • Undiscounted
Nonterminal episodic
states MDP (γ=1)
1,…,14
• • Nonterminal
Two states 1,…,14
terminal states (shaded squares)
• Two terminal states (shaded squares)
• Actions leading out of the grid leave state unchanged
• Actions leading out of the grid leave state unchanged
• Reward is -1 until the terminal state is reached
• Reward is -1 until the terminal state is reached
• Agent follows a uniform random policy
• Agent follows a uniform random policy

¼(nj¢) = ¼(ej¢) = ¼(sj¢) = ¼(wj¢) = 0:25


Evaluating a Random Policy in the Small Gridworld
Vk for the Greedy policy
random policy w.r.t. Vk

K=0 Random policy

K=1

K=2
Evaluating a Random Policy in the Small Gridworld
Evaluating a Random Policy in the Small Gridworld
Vk for the Greedy policy
random policy
Vk for the w.r.t. Vpolicy
Greedy k
random policy w.r.t. Vk

K=3 K=3

V := V ¼

K=10K=10 Optimalpolicy
Optimal policy

K=∞
K=∞
Value Iteration vs. Policy Iteration
Value Iteration vs. Policy Iteration Value Iteration & Policy Iteratio
Value Iteration vs. Policy Iteration
Value Iteration vs. Policy Iteration
• Note that the value function and policy are correlated

V ¼ (s) = R(s) + °
X
Ps¼(s) (s0 )V ¼ (s0 )

Value iteration
Value iteration Policy
Policyiteration
iteration ¼(s) = arg max
s0 2S
X
Psa (s0 )V ¼ (s0 )
Value iteration
Value iteration Policy
Policy iteration a2A
s0 2S

1. For
1. 1.each
ForForstatestate
each s, initialize
s, V(s)V(s)
initialize = 0.= 0. 1.1. iteration
Initialize
Initialize ππ randomly
• It is feasible to perform iterative update towards the
randomly
value function and optimal policy
each state s, initialize V(s) = 0. 1. Initialize π randomly • Value iteration
1. For each state s, initialize V(s) = 0. 1. Initialize π randomly
2. Repeat until
2. 2. Repeat
Repeat convergence
until
untilconvergence
convergence { {
{ 2.2.2. Repeat until
Repeatuntil
Repeat until convergence
convergence
convergence {{
• Policy iteration
{
2. Repeat until convergence { 2. Repeat until convergence {
For
For eachForeach
eachstate,
state, update
update
state, update a)a) Let
Let
a) a)Let VLet
V :=
:= VV ¼:= V
¼V ¼
For each state, update
XXX b) b)b) each
b)For Foreach
For
For each
each
state, state,
state,
state,
update update
update update
V (s) = R(s) + max ° 0 0 (s0 )V0 0(s0 )
Psa
V (s) = VR(s)
(s) =+ max
R(s) °a2A° P0 sa
+ max
a2A
Psa(s(s)V (s))
)V (s X XX 0 0
a2A s 2S
s0 2S ¼(s) = arg max maxPsa (s0 P
¼(s) = arg )Vsa(s(s0 ) )V (s0 )
s0 2S ¼(s) =
a2Aarg
a2Amax
s0 2S s0 2S
Psa (s )V (s0 )
} } a2A
s0 2S
} } }}
Remarks:
Remarks:
}
Remarks:
1. Value
1. Value iteration
iterationisisa agreedy
greedyupdate
update strategy
strategy
Remarks:
1. 2.
2.Value
In iteration
policy is
iteration,a greedy
the value update
function strategy
update by bellman equation is costly
In policy iteration, the value function update by bellman equation is costly
1. Value
2. 3. iteration
3.In For
policy
For is
small-space
small-spacea greedy
iteration,
MDPs,
MDPs, update
thepolicy
value strategy
function
iteration
policy update
is often
iteration veryby
is often Bellman
fast
very and andequation
fast converges isquickly
quickly
converges costly
2. In3.policy
4.For
4. small-space
For
For large-space
iteration,
large-space MDPs,
MDPs,
the value
MDPs, policy
value iteration
iteration
function
value iteration is ismore
isupdate
more often very
practical
by fast and
(efficient)
bellman
practical converges
equation
(efficient) quickly
is costly
5. If there is no state-transition loop, it is better to use value iteration
4. 5.
3. For ForIf large-space
small-space
there is no MDPs,MDPs, valueiteration
policy
state-transition iteration
loop, it isisis moretovery
often
better practical
fast(efficient)
use value and converges quickly
iteration
4. For large-space MDPs,
My point of view: value iteration
value iteration is like SGDis more
and policypractical
iteration is(efficient)
like BGD
My point of view: value iteration is like SGD and policy iteration is like BGD
5. If there is no state-transition loop, it is better to use value iteration
Convergence Analysis
• How do we know that value iteration
converges to v∗?
• Or that iterative policy evaluation converges
to vπ? And therefore that policy iteration
converges to v∗?
• Is the solution unique?
• How fast do these algorithms converge?
• These questions are resolved by contraction
mapping theorem.
Value Function Space
• Consider the vector space V over value functions
• There are |S| dimensions
– Each point in this space fully specifies a value function
v(s)
– What does Bellman backup do to the points in this
space?
• We will show that it brings value functions closer
• And therefore the backups must converge on a unique solution
s3
V={Vs1,Vs2,Vs3}
s2

s1
Value Function ∞-Norm
• We will
ll measure measure
distance distancestate-value
between between state-value
functions u a
functions u and v by
he 1-norm
the ∞-norm (infinity norm or max norm)
e largest
i.e. di↵erence betweenbetween
the largest difference state values,
state values,

||u v ||1 = max |u(s) v (s)|


s2S
s3 U
V
s2

s1
Banach’s Contraction Mapping
Theorem
• For any metric space V that is complete (i.e.
closed) under an operator T(v) , where T is a γ-
contraction,
– T converges to a unique fixed point: v = T(v)
– At a linear convergence rate of γ

γ-contraction means: 𝑑(𝑇 𝑣 − 𝑇(𝑢)) ≤ 𝛾𝑑(𝑣, 𝑢) where d


is the distance in the metric space and γ is the
parameter (this makes sure that the operator brings
value functions closer)
Introduction
Policy Evaluation
Markov Decision Process
Policy Iteration
Planning by Dynamic Programming
Value Iteration
Model-free Prediction
Theoretical Analysis

Bellman
OperatorOperator on Q-function
Model-free Control
Introduction
Policy Evaluation

Bellman on Q-function.
Markov Decision Process
Planning by Dynamic Programming
Model-free Prediction
Policy Iteration
Value Iteration
Theoretical Analysis

• Let usOperator
defineonBellmanQ-function. expectation operator on
Model-free Control

Bellman
Q-function
Bellman expectation
" 1 operator on Q-function.
X
Bellman expectation operator on Q-function.
#
⇡ "1 t #
Q (s, a) = E t X R |S = x, A0 = a
Q ⇡ (s, a) = E Rt |S0 =t x, A00 = a
t=0t=0
" " # #
X1
X1
t
= E R (S0 , A0 ) + Rt+1 |S0 =t s, A0 = a
= E R (S0 , A0t=0 )+ Rt+1 |S0 = s, A0 = a
= E [R (X0 , A0 ) + Q ⇡ (S1 , ⇡ (S t=01 )) |S0 = s, A0 = a]
X
E a)
==r (x, [R+(X0 , A +a QQ⇡⇡ s(S
P 0s)0 |s, 0 0 (S )) |S = s, A =
, ⇡1 ,s⇡ 1 0 0 a]
| S0 X
{z
= r (x, a) +,(T P s 0 |s, a Q}⇡ s 0 , ⇡ s 0
⇡ Q ⇡ )(s,a)

S 0Q-function.
• Bellman optimality operator on Q-function
Bellman optimality operator on
|
(T ⇤ Q) (s, a) , r (s, a) +
P } 0 {z
a) maxa0 2A Q (s 0 , a0 )
S 0 P (s |s,
,(T ⇡ Q ⇡ )(s,a)
82 / 185
Bellman optimality operator Pon Q-function.
(T ⇤ Q) (s, a) , r (s, a) + S 0 P (s 0 |s, a) max 0
a 2A Q (s 0 , a0 )

82 / 185
Bellman Operator on Q-function
Introduction
Markov Decision Process
Planning by Dynamic Programming
Policy Evaluation
Policy Iteration
Value Iteration
Model-free Prediction
Theoretical Analysis
Model-freeIntroduction
Control Policy Evaluation
• Value-based approaches try to find a fixed
Markov Decision Process
Planning by Dynamic Programming
Policy Iteration
Value Iteration
an Operator on Q-function.
Model-free Prediction

point 𝑄such that 𝑄- ≈ T∗ 𝑄-


- Model-free Control
Theoretical Analysis

ellman Operator on Q-function.


• And the
Value-based greedy try
approaches policy
to findfrom a Q̂Qˆsuch
is thus
that close
Q̂ ⇡ Tto ⇤ Q̂

the
And the optimal
greedy
Value-based policy
policy Q̂ is
approaches πto∗ findtoa the
tryclose optimal
Q̂ such ⇡ T ⇤ Q̂⇡ ⇤
that Q̂policy
And the greedy policy Q̂ is close to⇤ the optimal policy ⇡ ⇤
Q̂) ⇤
Q ⇡(s;
⇡(s;Q̂)
⇡ Q⇡

⇤ = Q

Q ⇡Q =Q
where
where where
⇡(s; Q̂)
⇡(s; Q̂) argmax Q̂(s,a)a)
argmaxQ̂(s,
a2A
a2A
MarkovEvaluation
Policy Decision Process
Markov Decision Process Policy Iteration
Planning by Dynamic Programming
Policy Iteration
Planning by Dynamic Programming Value Iteration

Contraction Property of Bellman


Value IterationPrediction
Model-free
Theoretical Analysis
Model-free Prediction Model-free Control
Theoretical Analysis
Model-free Control

Contraction Property of Bellman Operator o


Operator
Contraction Property on Q-function
of Bellman Operator on Q-function

kT ⇤ Q1 T ⇤ Q2 k1  kQ1 Q 2 k1
⇤ ⇤
kT Q1 T Q2 k1  kQ1 Q 2 k1
Qk+1 T ⇤ Qk
Qk+1 T ⇤ Qk

84 / 185
Theoretical Analysis
Model-free Control
Introduction

operty Contraction Property of Bellman


Policy Evaluation
Markov Decision Process
of Bellman Operator on Q-function
Planning by Dynamic Programming
Model-free Prediction
Model-free Control
Policy Iteration
Value Iteration
Theoretical Analysis

Operator
Contraction Property on Q-function
of Bellman Operator on Q-function
kT ⇤ Q1 T ⇤ Q2 k1  kQ1 Q 2 k1
kT ⇤ Q⇤1 T ⇤ Q2 k1  kQ1 Q 2 k1
Qk+1 T Qk
Qk+1 T ⇤ Qk

Let’s say Q2 is Q*

84 / 185
Policy Evaluation
Theoretical Analysis
Markov DecisionControl
Model-free Process
Introduction Policy Iteration
Planning by Dynamic Programming Policy Evaluation
Markov Decision Process Value Iteration
Model-free Prediction
Proof Proof of the Contraction Property
Policy Iteration
of the Contraction Property
Planning by Dynamic Programming
Model-free Control
Model-free Prediction
Model-free Control
Theoretical Analysis
Value Iteration
Theoretical Analysis

Proof of the Contraction Property


Proof⇤ of the Contraction Property
•| (TWe
Q1 )can
(s, a) see
⇤ that
(T Q 2 ) (s, a)| =
| (T ⇤ Q⇤1 ) (s, a) (T
X

Q ) (s, a)| =
| (T Q1 ) (s, a) (T ⇤ Q22 ) (s,0 a)| = 0 0 0 0
X P s |s, a max
 a0 2A Q 1 s , a max Q 2 s ,a
X 0 0 0 a 0 2A
0 0
S0 P P ss0 |s, |s,aa max
max Q Q 1s 0 ,sa0, a maxmaxQ Q
s 0 2 0s , a
, a
X SS00
0a0 2A
0
1
a 2A 0
0
a0 2A2
a 2A
0 0 0
 XXP s 0|s, a max0 2A
Q1 s , a Q2 s , a
0 a 0 00 0 0 0 0 0
 S 0 P
P ss |s,
|s,aa max
max
0 0
Q 1Q1
s , s
a , a Q 2 Q
s 2, a s , a
2A
S00
S
a a 2A
0 0 0 0
X0
 max Q 1 0s 0,0a 0 Q X
20 s00 , a0 X 0 P s0 |s, a
0 max0
 (s 0(s,a00 ,a)2S
0max ⇥A QQ , a, a Q2 Qs2 , as , a P s P
1 1s s |s, as |s, a
0)2S 00⇥A S 0
(s ,a )2S ⇥A S0
| |S{z
0 {z} }
| =1 {z
=1 }
• therefore we get
=1

Therefore, we
Therefore, we get get
Therefore, we get

sup |(T
⇤ Q1 ) (s, a) (T⇤⇤ Q2 ) (s, a)|  sup |Q1 (s, a) Q2 (s, a)|
sup |(T
sup |(TQ
(s,a)2S⇥A ⇤ ) (s, a)
Q1 1 ) (s, a) (T
(T ⇤QQ22))(s, a)| 
(s,a)| sup
 (s,a)2S⇥A
sup |Q11(s,
|Q (s,a)
a) QQ22(s,
(s,a)|
a)|
aka
(s,a)2S⇥A
(s,a)2S⇥A
Or, succinctly,
(s,a)2S⇥A
(s,a)2S⇥A

Or,Or,
succinctly,
succinctly, kT ⇤ Q1 T ⇤ Q2 k 1  kQ1 Q2 k 1

kT⇤ Q
kT Q1 1 TT⇤⇤QQ285
2kk/1
1
 kQ
185 kQ11 Q22kk1
1
Proof of the Contraction Property (2)
Proof of the Contraction Property
Introduction
Introduction Policy Evaluation
Markov Decision Process
Markov Decision
Policy Process
Evaluation
Policy Iteration Policy Iteration
Planning by Dynamic
Introduction
Planning by Dynamic Programming Programming
Policy
Value Evaluation
Iteration Value Iteration
Markov Decision
Model-free Process Model-free
Prediction Prediction
Theoretical Analysis
Policy Iteration
Theoretical Analysis
• Banach’s fixed-point theorem can be used to show the
Planning byModel-free
Dynamic Programming
Control Model-free Control
Value Iteration
Model-free Prediction
Theoretical Analysis
Model-free Control
Proof
Banach’s Proof
of the
existence and of the theorem
Contraction
uniqueness
fixed-point Contraction
Property
of the Property
(2)be
optimal
can to (2)
value
used show the
Proof of the Contraction Property (2)
function.
existence and uniqueness of the optimal value function.
• Also, since
Also, we
since
Banach’s have
we Q ⇤fixed-point
have theorem
fixed-point
Banach’s =T ⇤Q
can be⇤theorem
,used to show the used to show th
can be
Banach’s fixed-point theorem can be used to show the
existence and uniqueness
existence of the
and optimal value
uniqueness of function.
the optimal value functio
existence and uniqueness of the optimal value function.
kT ⇤ Q
Also,
• Thus, forAlso,since
value 0 weQ ⇤
k
have
sinceiteration
Also, Q=⇤ = T ⇤⇤Q ⇤ ,
kT
we have Q where Q T ⇤ ⇤⇤ ⇤
Q
= T Q ,Qk+1 ← T Qk, we have
1since we have
⇤ ⇤ 0 Q =T Q
⇤ ⇤ k∗1,  kQ 0 Q ⇤
k1
kT ⇤kT
Q0⇤ Q Q ⇤ Q
k1 ⇤ ⇤ ⇤
⇤ =⇤ kT Q⇤0 ⇤ T Q
kT

⇤ k⇤1 ⇤ kQ0 ⇤ Q⇤ ⇤⇤
Q0 k1where
T=QkT Qkk1
Therefore, for0 value k1 Q=0kT Q
iteration k1QQk+1
0 kQT0 QT 1
Q
k1 kQhav
k , we 0
• FinallyTherefore,
asTherefore,
k-> ∞, the
for for
value RHS converges
iteration
value where
iteration whereQQ to zero,
TT
⇤Q
⇤Q showing
, we
we have
Therefore, for value k+1
iteration k
where k, Q have T ⇤ Q , w
thatkQk Q k1 = (T ) Q0 T Q
⇤ ⇤ (k) k+1
⇤ ⇤

k+1
k
kQ0 Q ⇤ k1
k
⇤ (k)
kQkkQkQ ⇤ kQ1⇤ k=
1 =(T )
(T ⇤ (k)
)
⇤ Q Q
0 0 T ⇤ ⇤⇤ ⇤
TQ Q
⇤ (k)
k1
kkQ
kQ0 Q
⇤⇤
⇤0 ⇤ kk1
Q 1 k
kQk Q k1 = (T )11Q0 T Q  kQ0
• This means
Finally, As Q k →! Q
1,∗ the RHS converges to zero, 1 showing tha
Finally, As kAs
Finally, k! 1,1,
k! thethe
RHS converges
RHS convergestotozero,
zero,showing
showing that
that
kQkQ QkQ⇤⇤kkQ1
k kkQ ⇤ k!
1 1 ! 0.
Finally,
!
0. 0. This
As
This
This k ! 1,
shows
shows
shows that
that that
the
QQk !
k Q
RHS
!QQ
k ⇤!
. . Q ⇤ . to zero, show
⇤converges
kQk Q ⇤ k1 ! 0. This shows that Qk ! Q ⇤ .
86 / 185
86 / 185
86 / 185
Markov Decision Process
Policy Iteration
Planning by Dynamic Programming
Value Iteration
Model-free Prediction
Theoretical Analysis
Model-free Control

Proof of the Contraction Property


Proof of the Contraction Property (3)

87 / 185
Theoretical Analysis
Model-free Control

Bellman Expectation Backup is a Contraction


Bellman Expectation Backup is a Contraction

Define the Bellman expectation backup operator T ⇡ ,


• Define the Bellman expectation backup
operator Tπ T ⇡ (v ) = R⇡ + P ⇡ v
• This operator
This operator is ais -contraction,
a γ-contraction, i.e. itvalue
i.e. it makes makes
functions
value functions
closer by at least , closer by at least γ,

kT ⇡ (u) T ⇡ (v )k1 = k(R⇡ + P ⇡ u) (R⇡ + P ⇡ v )k1


= k P ⇡ (u v )k1
 k P⇡k u v k1 k1
 ku v k1

88 / 185
ptimality Backup is a Contraction
Optimality Backup is a Contraction
Bellman Optimality Backup is a Contraction

• Define the Bellman optimality backup ⇤


e the Bellman optimality backup operator T , ⇤
fine theoperator
BellmanT optimality
,

backup operator T ,
⇤ a a
T (v⇤ ) = max R + a
P va
T (v ) = max R + P v
a2A
a2A
• Thisisoperator
operator is a γ-contraction,
a -contraction, i.e. itvalue
i.e. it makes makesfunction
sby
operator
value is a -contraction,
functions closer by i.e.
at
at least (similar to previous proof) it
leastmakes
γ value
(similar to funct
er by previous
at least proof)
(similar to previous proof)
||T ⇤ (u)

T ⇤
(v⇤ )||1  ||u v ||1
||T (u) T (v )||1  ||u v ||1
Convergence of Value Iteration
• Thus, by contraction mapping theorem
– The Bellman optimality operator T∗ has a unique
fixed point
– v∗ is a fixed point of T∗ (by Bellman optimality
equation)
– Value iteration converges on v∗
• However
– Exact representation of the value function for
large state space is infeasible, i.e. Q(s, a) for all (s,
a) ∈ S × A.
– The exact summation in the Bellman operator
requires to know the environmental model, i.e.
T∗(v) = maxa∈A Ra + γPav.
Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
• Convergence Analysis
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
– Convergence Analysis
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
Learning an MDP Model
Learning an MDP Model
• So far we have been focused on
• –SoCalculating
far we havethebeen
optimal value function
focused on
– Learning the the
• Calculating optimal policy
optimal value function
• Learning the optimal policy
given a known MDP model
given a known MDP model
– i.e. the state transition Psa(s’) and reward function R(s) are explicitly
• i.e. the state transition Psa(s’) and reward function R(s) are explicitly
given
given
• •InInrealistic
realisticproblems, oftenthe
problems, often thestate
statetransition
transition and
and reward
reward
function
functionare arenot
not explicitly given
explicitly given
– For example,
• For example,we
wehave
have only
only observed someepisodes
observed some episodes
(1)
(1) (1)
(1) (1) (1)
a a11 a22
Episode 1: 1:
Episode
(1)
s0 ¡¡¡00¡¡!
(1)
s1 ¡¡¡¡¡! (1)
s2 ¡¡¡¡¡ ! s
(1)
3 ¢ ¢ ¢ s
(1)
T
R(s00 ))(1)
R(s (1) (1)
R(s11 ))(1)
R(s (1)
R(s22 ))(1)
R(s

(2)
(2) (2) (2)
(2) a
a00 (2) a(2)
a (2) a(2)
a (2) (2)
Episode
Episode 2: 2:
(2)
ss0 ¡ ¡
¡¡
¡¡ ¡¡ ! (2)
ss1 ¡ ¡¡¡11¡
¡¡ ¡¡ ¡!! (2)
ss22 ¡ ¡¡¡22¡
¡¡ ¡¡ ! ss(2)
¡! ¢ ¢¢ ¢¢ ss(2)
Episode 2: 0 ¡ R(s )
¡ !
(2)
(2)
1 R(s ) (2)
(2) (2)
R(s22 ))(2)
3
3 ¢ T
T
R(s00 ) R(s11 ) R(s
(1) (1) (1) (1) (1) (1)
(1) a0 (1) a1 (1) (1) a0 a1
a2 (1) (1) (1)
(1) a2 (1) (1)
sode 1: s0 ¡¡¡¡ ¡! s1 1:¡¡¡¡s¡0! ¡
Episode s¡
2
¡ ¡ ¡
¡!
¡¡ ¡s¡1! ¡
¡
s3
¡¡ ¢¡!
¢ ¢ ssT2
¡¡¡¡¡! s3 ¢ ¢ ¢ sT
R(s0 )(1) R(s1 )(1) R(s0 )(1)
R(s2 )(1) R(s1 )(1) R(s2 )(1)

Learning an MDP Model


(2) (2) (2) (2) (2) (2)
(2) a0 (2) a1 (2) (2) a0 a1
a2 (2) (2) (2)
(2) a2 (2) (2)
sode 2: s0 ¡¡¡¡ ¡! s1 2:¡¡¡¡s¡0! ¡
Episode s¡
2
¡ ¡ ¡
¡!
¡¡
(2)
¡s¡1! ¡
¡
s ¡¡ ¢¡!
¢
3 (2) T2¢ ss ¡¡¡¡¡! s3 ¢ ¢ ¢ sT
R(s0 )(2) R(s1 )(2) R(s0 )R(s2 )(2) R(s1 ) R(s2 )(2)



• Learn an MDP model from “experience”
arn an MDP• model
–Learn an MDP
from
Learning model from
“experience”
state transition “experience”
probabilities Psa(s’)
• Learning
Learning state transition state transition
probabilities probabilities Psa(s’)
Psa(s’)
#times we took 0 #times
action a in we took
state s action
and gota to state ss0 and got to state s0
in state
0
sa (s ) =
Psa (s ) =
#times we took action #times
a in we took
state s action a in state s

Learning •reward• R(s),


Learning reward
Learning
i.e. R(s),
reward
the R(s),i.e.
expected the
i.e. the expected
immediateexpected immediate
immediate
reward reward
reward
n o n o
(i)
R(s)
R(s) = average R(s)(i) = average R(s)

• Algorithm
1. Initialize π randomly.
2. Repeat until convergence {
a) Execute π in the MDP for some number of trials
b) Using the accumulated experience in the MDP, update our estimates for Psa
and R
c) Apply value iteration with the estimated Psa and R to get the new estimated
value function V
d) Update π to be the greedy policy w.r.t. V
}
•• So
So far
far we
we have
have been
been focused
focused on
on
•• Calculating the
Calculating the optimal
optimal value
value function
function
Model-free
••
given
Learning the
Learning
a known MDP Reinforcement
the optimal
model
given a known MDP model
Learning
optimal policy
policy

•• i.e.
i.e. the
the state
state transition
transition P
Psa(s’)
sa
(s’) and
and reward
reward function
function R(s)
R(s) are
are explicitly
explicitly
given
• Ingiven
realistic problems, often the state transition and reward
•• In
In function
realistic are not explicitly
realistic problems,
problems, often the given
often the state
state transition
transition and
and reward
reward
function
function are
• Forare not explicitly
not explicitly
example, we havegiven
given
only observed some episodes
•• For
For example,
example, we
we have
have only
only observed
observed some
some episodes
episodes
(1) (1) (1)
a(1) a (1) a (1)
Episode 1:1:
Episode 1:
Episode
(1)
(1)
ss00 ¡¡
¡
a
¡¡
¡¡
00
¡¡
¡!
!
(1)
(1)
ss11 ¡
¡¡
a11
¡¡
¡¡
¡¡
¡!
!
(1)
ss2(1) ¡
¡¡
¡
a22
¡
¡ ¡
¡ ¡
¡!
! s
s
(1)
3
(1) ¢ ¢ ¢ s(1)
¢ ¢ ¢ sT
(1)
(1)
R(s00 ))(1) R(s (1) 2 (1) 3 T
R(s R(s11 ))(1) R(s
R(s22 ))(1)

(2) (2) (2)


(2) a(2)
a (2) a (2)
a11 (2) a (2)
a22 (2) (2)
Episode2:2:
Episode s(2)
0
0
¡¡¡¡¡!
0
s(2)
1 ¡¡¡¡¡! s2(2) ¡¡¡¡¡! s3(2) ¢ ¢ ¢ sT(2)
(2)
R(s00 )(2) (2)
R(s11 )(2) (2)
R(s22 ) (2)

• Model-free RL is to directly learning value & policy from


experience without building an MDP
• Key steps: (1) estimate value function; (2) optimize policy
Value Function Estimation
Value
Learning
LearningFunction
an
an MDP
MDP Estimation
Model
Model
• In model-based RL (MDP), the value function is
• In model-based RL (MDP), the value function is
calculated
•• So
So far we
far byby
we have beendynamic
have been
calculated onprogramming
focused on
focused
dynamic programming
•• Calculating the
Calculating
¼
the optimal
optimal value
value function
function
) + °R(s1 ) + ° 2 R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]
•• Learning (s) optimal
V the = E[R(s0policy
Learning the optimal policy
X
given a known MDP
= R(s) + model
° Ps¼(s) (s0 )V ¼ (s0 )
given a known MDP model
s0 2S
•• i.e.
i.e. the
the state
state transition
transition P
Psa(s’)
sa
(s’) and
and reward
reward function
function R(s)
R(s) are
are explicitly
explicitly
• Now
• Now in model-free RL
givenin model-free RL
given
•• In •We
In•realistic
realistic problems,
Wecannot
cannot directly often
directly
problems, know
often the
knowPsastate
the Pand Rtransition
transition and
sa and R
state and reward
reward
function
function arehave
• But we
are not explicitly
not explicitly given
a list of experiences
given to estimate the values
Butexample,
•• For we have we a listonly
have of experiences to estimate the values
observed some episodes
• For example,(1)we ahave
(1)
0
only
(1)
observed
(1)
a1 (1)
some
a2 episodes
(1)
(1) (1)
Episode 1: s0 ¡¡¡¡(1)
¡! s1 ¡¡¡¡¡! s2 ¡¡¡¡¡!
(1)
s ¢ ¢ ¢ sT
(1) 3
)0(1)
a0 (1) (1)
) 1(1) (1)
(1)
Episode
Episode 1:
Episode 1:
1:
(1) R(sa
(1)
ss00 ¡ ¡¡
¡¡¡¡ 0¡¡ ¡!!
(1)R(s1 a
(1)
ss11 ¡ ¡¡
¡¡
a
¡1¡ ¡¡ ¡ !
!
R(s2 ) a
(1)
(1)
ss22 ¡ ¡¡
¡¡
a22
¡¡¡¡ ¡!
!
(1)
ss3(1) ¢
¢ ¢
¢ ¢
¢ s
s
(1)
(1)
(1) (1) (1) 3 T
T
R(s
R(s
(2) 0)
0
)(1) R(s
R(s
(2) 11 )
) (1) R(s
R(s22 ))(1)
(2)
(2) a (2) a (2) a (2) (2)
Episode 2: s0 ¡¡¡0¡¡! s1 ¡¡¡1¡¡! s2 ¡¡¡2¡¡! s3 ¢ ¢ ¢ sT
R(sa (2)
)(2) )(2)
R(s1 a (2) R(s2 )(2)
(2)
0 (2) (2) a (2)
(2) a00 (2) a11 (2) a (2) (2)
Episode
Episode 2:
2:
(2)
s0 ¡¡¡¡¡!
(2)
(2)
s1 ¡¡¡¡¡!
(2)
s2(2) ¡¡¡22¡¡! s3(2) ¢ ¢ ¢ sT(2)
(2)
R(s00 )(2) R(s11 )(2) R(s22 ) (2)

Monte-Carlo
Monte-Carlo Methods
Methods
Monte-Carlomethods
•• Monte-Carlo methods areare a broad
a broad class
class of of
computationalalgorithms
computational algorithms that
that relyrely
on on repeated
repeated
randomsampling
random samplingtoto obtain
obtain numerical
numerical results.
results.
For example,
•• Example, to calculate
to calculate the circle’s
the circle’s surfacesurface

#points in circle
Circle Surface = Square Surface £
#points in total
Monte-Carlo Methods
• Go Game: to estimate the winning rate given the current state

# win simulat ion cases st art ed from s


Win Rat e(s) =
# simulat ion cases st art ed from s in t ot al
Monte-Carlo Value Estimation
Monte-Carlo Value Estimation
Monte-Carlo Value
Monte-Carlo Estimation
Value Estimation
Monte-Carlo
• •Goal:
Goal: learn
πValue
learn VV from
from Estimation
episodes
episodes π of experience
of experience underπ policy π
under policy
• Goal:
• Goal: learn Vπlearn
from π from episodes of experience under policy π
Vepisodes
(i) of
(i) experience
(i) under(i)policy π
(i) a0 (i) a1 (i) a2 (i)
• Goal: learn Vπ from s episodes
(i) 0 (i)
¡¡ (i) s of
! ¡¡experience
0 a(i)1(i) a(i)
a(i) ! s2(i) ¡
(i)
1 a(i) a¡ !under
(i)
3 ¢ ¢policy
s(i) ¢ s(i)T »π ¼
a 0 s0 (i) ¡1¡! 2(i)
(i) (i) R (i)1 s1 ¡ ¡2!
(i)
R (i) s 22 ¡ R¡ 3! s3 ¢ ¢ ¢ sT » ¼
(i) (i)
s(i)
0 ¡
a ¡
0 ! s 1 R11 (i) (i) 2R22 (i)(i) R33 (i) T » ¼
(i) a¡
(i)¡ ! s a(i)¡ ¡ ! s (i) ¢ ¢ ¢ s
s
• Recall ¡
0 that¡ !(i) s ¡¡R ! s2is ¡ ¡! s ¢ ¢discounted
¢ sT » ¼ reward
1 the 1 return the (i)Rtotal
• Recall that
• Recall 1 the
R(i)
that
R the return
R2
(i)2
return isisR3the
the
3 3
total total discounted
discounted reward reward
• •Recall
Recallthat the return
that the return
Gt = is isthe
Rthet+1 total
+
total °R discounted
discounted
t+2 °TT¡1
+ : : : reward ¡1 reward
RT
Gt = Rt+1 + °Rt+2 + : : : ° RT
• Recall G
G that
t == RRthe
t+1 value
+ + °R °R function
t+2+ +
: : ::°:is:¡1
T ° T ¡1expected return
the RT
RTexpected
• Recall that the value function is the
t t+1 t+2 return
• Recall
• •Recall that the value function is the expected return
Recallthat the
that the value
V ¼ (s)
value= function
function
E[R(s ) + is
°R(s is
the the
) + ° expected
2
expectedR(s ) +
V ¼ (s) = E[R(s0 ) + °R(s1 ) + ° 2 R(s 2) + ¢ ¢ ¢ js 0= s; ¼]
return
¢ ¢ ¢return
js = s; ¼]
0 1 2 0
= E[G t js t = s; ¼]
VV¼¼(s) E[R(s00)=)++
(s) = E[R(s E[G
°R(s t js1t )=
°R(s +
1 )s;°
+ 2¼] 2
R(s
° R(s2) +2 )¢ ¢+
¢ js¢ 0¢ ¢=jss;
0 ¼]
= s; ¼]
N
= E[G 1s;X X N
= E[Gttjsjs
t t==1
''N
¼]
s; ¼] GGtt
(i)
(i) •• Sample
Sample N N episodes
episodesfrom fromstate
states susing
usingpolicy
policy
ππ
11 X
N
N (i)N i=1 •• Calculate
Calculate thethe average
averageofofcumulated
cumulatedreward reward
'
X Gt (i)
i=1
• Sample N episodes from state s using policy π
'
• •Monte-Carlo
N
Monte-Carlo policy
policy • Sample
Gt evaluation
evaluation uses
uses Sampleof
•Nempirical
episodes
empirical N
mean
mean episodes
from return
return from
stateinstead
s using
instead of state
policy
of πs using policy π
expected
expected
N i=1 • Calculate the average cumulated reward
return
return i=1 • Calculate • theCalculate
averagethe average of
of cumulated cumulated reward
reward
• Monte-Carlo policy evaluation uses empirical mean return instead of expected
• Monte-Carlo policy evaluation uses empirical mean return instead of expected
return

return
Monte-Carlo policy evaluation uses empirical mean return instead of expected
return
Monte-Carlo Value Estimation
Monte-Carlo
Monte-Carlo Value
ValueEstimation
Estimation
• Implementation
Monte-Carlo Value
Monte-Carlo Estimation
Value Estimation
• Implementation
Monte-Carlo Value Estimation
– Sample
• episodes policy π
Implementation
• Sample episodes policy π
••Sample • Implementation
•Implementation
Implementation
episodes policy(i)π
(i)
•s0Sample
(i) a
(i)
• Sample
(i) a (i) episodes
a (i) policy π
s1 episodes policy s3 π¢ ¢ ¢ sT » ¼
(i)
• Sample
¡¡ 0
!aepisodes
(i)
(i)(i) 0
¡¡ 1
!apolicy
(i)(i) 1
s2 ¡
(i) ¡2
π!a(i)
(i)(i) (i)(i) (i) (i)
(i) 2
s0R1¡¡ ! (i) s1(i)
a
R2 ¡¡ !
(i) s
(i) R a30¡
(i)
2
a a
¡
(i) ! (i)s
a
a1 ¢ ¢ ¢ (i)
(i)
3 s T »
a2 ¼ (i) (i)
(i)s0 (i)1¡¡! ss1(i) ¡
2 ¡! s2» ¼¡¡! s3 ¢ ¢ ¢ sT » ¼
a
(i) (i) a
(i) 1(i) (i) (i)(i) (i)
s0 R1(i)
s 0
¡0¡ ! s01 ¡
¡¡! sR¡2!
1
s 2 ¡¡! s
¡ ¡2 (i)
R
(i)!3 3 ¡¡! s
¢ ¢ ¢ s
(i) T 3 ¢ ¢R¢ (i)
sT » ¼
R1 (i) 2 R
• Every
Every time-step
time-step t1tthat that state s is visited
in an in an
(i) (i)

• state s is visited
R1 R(i) R2 R R
(i)
3 R 2
(i) 3
2 3

• Every
episode
• Everytime-step tt that
• Every states iss tvisited
time-step isthat
visited
in anin
state anvisited in an
s is
episode
•episode
episode
Increment
time-step
• Every time-step that
episode
counter
state
t that state s is visited in an
N (s) Ã N (s) + 1
•episode
• Increment counter
• Increment
Increment
• Increment • Increment
counter N (s) Ã counter
counter
total N (s) + 1 N (s) Ã N (s) + 1
return NS(s) (s) Ã ÃN S ((s)
s) ++G 1t
• Increment
• Increment total counter
return
• by S (N
s ) (s)
à SÃ(s)N+ (s)
G t + 1
• Increment
• •
Value is total
estimated
Increment total
• Increment• total
• Value is estimatedreturn
Increment
mean
return
by return
mean
S total
return
( s ) Ã
return
return
SV
S (s) Ã
V
(s
(
(
)
s
s)
)
S
+=
=
SG
(Sst)(sÃ
(s) +return
S ( s )
S ((ss))+ Gt
)=N
Gt s) V (s) = S (s)=N (s)
=N (
• By
• Value
law oflaw
islarge Value
numbers
estimated is
by mean estimated
return by mean
• Value •is• estimated
By Value of large by
numbers
is estimated
• By¼law mean
by
of mean
large return
return
numbers
V (s) = S (s)=N (s)
V (s) = S (s)=N (s)
• By law
• Byoflaw
Vlarge
(s)Vof! numbers
)V
(slarge ¼
! V(s ) asNN(s(s))!
)(sas !1 1
• By law of large numbers
¼
numbers ¼
V (s) ! V (s) as N (s) ! 1
V (s) V!(sV) !
(sV) ¼as(sN (s)N!
) as (s)1! 1
Incremental Monte-Carlo Updates
Incremental Monte-Carlo Updates
Incremental
• Update Monte-Carlo
V(s) incrementally Updates
after each episode
• Update V(s) incrementally after each episode
• For each state St with cumulated return Gt
• •For each state
Update V(s) Sincrementally
t with cumulated return
after eachGtepisode
• For each state St with cumulated return Gt
N (St ) Ã N (St ) + 1
1
V (S ) Ã V (S )
N (St ) Ã N (StN+ )+ (Gt ¡ V (St ))
(St1)
t t
1
• For non-stationaryVproblems
(St ) Ã V(i.e.
(St )the
+ environment could
(Gt ¡ V (St ))be
• Forvarying
non-stationary problems
over time), it can be useful(i.e. N the
(S
to track ) environment
t a running mean, could be
varying
•i.e. over
Forforget old time),
episodesit
non-stationary can be(i.e.
problems useful to track a running
the environment could be mean,
i.e. varying
forget over
old episodes
time), it can be useful to track a running mean,
i.e. forget oldV episodes
(St ) Ã V (St ) + ®(Gt ¡ V (St ))

V (St ) Ã V (St ) + ®(Gt ¡ V (St ))


Monte-Carlo Value Estimation
Monte-Carlo Value Estimation
N
1 X (i)
Idea: V (St ) ' Gt
N
i=1

Implementation: V (St ) Ã V (St ) + ®(Gt ¡ V (St ))

• MC methods
• MC methods learn directly
learn from
directly episodes
from episodesof of
experience
experience
• MC is model-free: no knowledge of MDP transitions / rewards
• MC is model-free: no knowledge of MDP transitions / rewards
• MC learns from complete episodes: no bootstrapping (discussed
• MC learns from complete episodes: no bootstrapping (discussed
later)
later)
• MC usesuses
• MC the the
simplest possible
simplest idea:
possible value
idea: = mean
value = meanreturn
return
• Caveat: cancan
• Caveat: onlyonly
apply MCMC
apply to episodic MDPs
to episodic MDPs
– All•episodes mustmust
All episodes terminate
terminate
Temporal-Difference Learning
• TD methods learn directly from episodes of experience
• TD is model-free: no knowledge of MDP transitions /
rewards
• TD learns from incomplete episodes, by bootstrapping
• Temporal-Difference Learning
TD updates a guess towards a guess

Gt = Rt+1 + °Rt+2 + ° 2 Rt+3 + : : : = Rt+1 + °V (St+1 )

V (St ) Ã V (St ) + ®(Rt+1 + °V (St+1 ) ¡ V (St ))

Observation Guess of
future

• TD methods learn directly from episodes of experience


Sutton, Richard S. "Learning to predict by the methods of temporal
• TD is model-free:
differences." no knowledge
Machine learning of9-44.
3.1 (1988): MDP transitions /
Monte Carlo vs. Tempora
Monte Carlo vs. Temporal
Monte Carlo vs. Temporal Difference
• The same goal: Difference
learn V from episodes o π

Monte
Monte Carlo vs. Temporal
policy π Difference
• TheMonteCarlo
Carlovs.
same vs.πTemporal
goal: Temporal Difference
• IncrementalDifference
learn every-visit Monte-Carlo
• The same goal: learnVVππfrom episodesofof
from•episodes experience
experience underunder
• The
policy πsame
policy goal:
π goal: learn V from Update value
episodes of V(St) towardunder
experience actual return Gt
• •The
Thesame
same goal:learn
π
learnVVπfrom
fromepisodes
episodesofofexperience
experience under
under
policy π V (S ) Ã V (St ) + ®(Gt ¡ V
• Incremental
Incremental
• policy every-visitMonte-Carlo
policyππ every-visit Monte-Carlo t
•• Incremental every-visitMonte-Carlo
Monte-Carlo
Update
Incremental

• • Update value
value
Incremental V(S
V(S ) )
every-visittoward
t toward
every-visit actual
actual
Monte-Carlo
• return
return
Simplest Gt Gt
temporal-difference learning a
• Update value V(S ) toward actual return G
• • Update
t
Updatevalue
valueV(S t) toward actual return GtG t
t
V(S t ) toward actual return
•+Update t value V(S ) toward estimated return
VVV(S
(S t )
t ) ÃÃ V
V (S
(S t )
t )
+ ®(G
®(G
(S ) Ã V (S ) + ®(G ¡ V (S )) t ¡ t ¡
V V
(S t
(S
)) t ))
t
V (S
t t ) Ã V (S
t t ) + ®(G
t t ¡ V (S
t t ))
V (St ) Ã V (St ) + ®(Rt+1 + °V (S
•••Simplest
• Simplest
Simplest temporal-difference
temporal-difference
temporal-difference
Simplest temporal-difference learning
learning
learning
learning algorithm:
algorithm:
algorithm: TD TD TD
TD
algorithm:
• Simplest temporal-difference learning algorithm: TD
• Update value V(S ) toward estimated
•• • Update
Update value V(S t ) toward return R + °V (S )
• TD target:
estimated return Rt+1 + °V (S) t+1
t+1) t+1
• Update Updatevalue
value
valueV(S
V(S
V(St))ttoward
towardestimated
t t) toward estimatedreturn
estimated RR
return
return
• TD error:
t+1 +R°V
t+1 +
(S
°V
t+1 +t+1°V
t+1
(S ) (St+1 )
±t = Rt+1 + °V (St+1 ) ¡ V (St )
VVV(S
(S
(St )tt))Ã
ÃÃVV (S
V(S(S )+
t )tt)++ ®(R
®(R
®(R
t+1
t+1
+°V
++°V
t+1 °V
(S(S(S
t+1 )¡
t+1
t+1 )¡)V¡
V(SV (S
t ))
(S t ))t
))
V (St ) Ã V (St ) + ®(Rt+1 + °V (St+1 ) ¡ V (St ))
•• •TD
TDTDtarget:
TD target:
target:
target:RR Rt+1
t+1 ++°V°V(S(S
t+1 + °V (S t+1 ) ))
t+1
t+1
• TD
•• •TD
target:
TD error:R±±tt+1
error:
TDerror:
TD error: ±tt===R+R
R
t+1°V
++
t+1
t+1 +(S
°V°V
°V (S(S
t+1(S )t+1
t+1 )¡
t+1))¡V¡V(S
V(S)t )t )
t(S

• TD error: ±t = Rt+1 + °V (St+1 ) ¡ V (St )


Driving Home Example
Each day as you drive home from work, you try to predict how
long it will take to get home

Elapsed Time Predicted Predicted


State
(Minutes) Time to Go Total Time
Leaving office 0 30 30
Reach car,
5 35 40
raining
Exit highway 20 15 35
Behind truck 30 10 40
Home street 40 3 43
Arrive home 43 0 43

Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction.


Vol. 1. No. 1. Cambridge: MIT press, 1998.
t t
ame goal: learn Vπ from episodes of experience under
V (St ) Ã V (St ) + ®(Gt ¡ V (St ))
π

Driving Home Example: MC vs. TD


mental every-visit Monte-Carlo • Simplest temporal-difference learning algorithm: TD
date value V(St) toward actual return Gt • Update value V(St) toward estimated return Rt+1 + °V (St+1 )

V (St ) Ã V (St ) + ®(Gt ¡ V (St )) V (St ) Ã V (St ) + ®(Rt+1 + °V (St+1 ) ¡ V (St ))

𝑉 exiting highway ←15 + ((43-20)-15) • 𝑉 exiting


TD target: highway
Rt+1 + °V (St+1 ) ←15 + (30+10-35)
est temporal-difference learning algorithm: TD
• TD error: ±t = Rt+1 + °V (St+1 ) ¡ V (St )
date value V(St) toward estimated return Rt+1 + °V (St+1 )

V (St ) Ã V (St ) + ®(Rt+1 + °V (St+1 ) ¡ V (St ))


Each error is proportional to the
target: Rt+1 + °V (St+1 ) change over time
error: ±t = Rt+1 + °V (St+1 ) ¡ V (St )

Gt is the award when reached home, so By contrast, in TD, each estimate would
the change can only be made offline after be updated toward the estimate that
have reached home immediately follows it.
Advantages and Disadvantages of MC vs. TD

• TD can learn before knowing the final outcome


– TD can learn online after every step
– MC must wait until end of episode before return is known

• TD can learn without the final outcome


– TD can learn from incomplete sequences
– MC can only learn from complete sequences
– TD works in continuing (non-terminating) environments
– MC only works for episodic (terminating) environments
• Update value V(St) toward actual return Gt
Bias/Variance Trade-Off V (St ) Ã V (St ) + ®(Gt ¡ V (St ))
Bias/Variance
• Bias/Variance Trade-Off
Simplest temporal-difference
Trade-Off
learning algorithm: TD T ¡1
Return G =
Bias/Variance
t • UpdateR + °R
t+1value V(St)Trade-Off
t+2 + : : : °
toward estimated return RTRt+1is+unbiased
°V (St+1 )
• Return
estimate of Vπ(SGt)= R + °R + : : : ° R is unbiased
• Return t t+1 t+2
T ¡1
T is unbiased
estimate of V π
estimate V (S(St )) Ã V (St ) + ®(R t+1 + °V (St+1 ) ¡ V (St ))
• ReturnofGVt =
π(S ) t T ¡1
R
t t+1 + °Rt+2 ¼+ : : : ° RT is unbiased
True TD target
• estimate R
True TD target
t+1
of V R + °V (S
π(S ) t+1 + °V ¼
(S t+1 t+1 )
) is is
unbiased unbiased
estimate of estimate o
π • True TD V• (Starget
π
TD t
) target: Rt+1 + °V (St+1 ) is unbiased estimate of
V (St) Vπ(S• •)True

TD TD
t
TD target
error:
target R t+1 +±t°= R V (St+1
¼
Rt+1 + °V (St+1 ) is unbiased estimate of
+) °Vis biased
(St+1 )estimate π
¡ V (St )of V (S )
t Vπ(St) | t+1
{z } t

TD target is biased estimate π(S


• TD targetRt+1target+ ° + ° current
V|V(S ) ) is biased estimate ) Vπ(Sof
Vπ(Stof ) V
estimate
• TD Rt+1 (St+1t+1 is biased estimate of t t
• TD target is of muchcurrent
| {z
{z } }
lowerestimate
variance than the return
current
• Return depends estimate
on many random actions, transitions and rewards
• • TD target
TD target depends
is of muchonlower
one random action,
variance transition
than and reward
the return
• TD target is ofdepends
• Return muchonlower variance
many random actions,than theand
transitions return
rewards
TD target is• of
• Return much
TDdepends
target on lower
many
depends variance
on onerandom
random actions,
action, than and the
transitions
transition andreturn
reward rewards
• TDdepends
• Return target depends on one random
on many random action, transition
actions, and reward
transitions and rewar
• TD target depends on one random action, transition and reward
Advantages and Disadvantages of MC vs. TD (2)

• MC has high variance, zero bias


– Good convergence properties
– (even with function approximation)
– Not very sensitive to initial value
– Very simple to understand and use

• TD has low variance, some bias


– Usually more efficient than MC
– TD converges to Vπ(St)
• (but not always with function approximation)
– More sensitive to initial value then MC
Monte-Carlo Backup
Monte-Carlo Backup
V (St ) Ã V (St ) + ®(Gt ¡ V (St ))

Backup: Updating the value of a state using values of future states


Temporal-Difference Backup
Temporal-Difference Backup
V (St ) Ã V (St ) + ®(Rt+1 + °V (St+1 ) ¡ V (St ))
Dynamic Programming Backup
Dynamic Programming Backup
V (St ) Ã E[Rt+1 + °V (St+1 )]

Backup: Updating the value of a state using values of future states


Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
– Convergence Analysis
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
Uses of Model-Free Control
• Some example problems that can be modeled as MDPs
• Elevator • Robocup soccer
• Parallel parking • Atari & StarCraft
• Ship steering • Portfolio management
• Bioreactor • Protein folding
• Helicopter • Robot walking
• Aeroplane logistics • Game of Go

• For most of real-world problems, either:


• MDP model is unknown, but experience can be sampled
• MDP model is known, but is too big to use, except by samples
• Model-free control can solve these problems
On- and Off-Policy Learning
• Two categories of model-free RL

• On-policy learning
– “Learn on the job”
– Learn about policy π from experience sampled from π

• Off-policy learning
– “Look over someone’s shoulder”
– Learn about policy π from experience sampled from
another policy μ
State Value and Action Value
State Value and Action Value Value
State Value and Action
State Value and Action Value
T ¡1
G =R
Gtt = Rt+1
t+1 + °R
+ °Rt+2 t+2
+ :::°
+ : : : ° T ¡1 RT
RT
Gt = Rt+1 + °Rt+2 + : : : ° T ¡1 RT
• •State
Statevalue
• State value
value
π
• The
• –The
• State The state-valuefunction
state-value
value
state-value functionVVπ(s)
function V(s) ofanof
πof
(s) anMDP
MDP
an is is
MDP the
the expected
expected
is the expected
return
return starting from state
starting function
from s and then following policy π
• return
The starting
state-value fromstate
Vπs(s)
stateand then
sofand following
then
an MDP policy πpolicy
is following
the expected π
return starting from state
V ¼ (s) = sEand then following policy π
¼ [Gt jSt = s]
¼
V
¼
(s) = E¼ [Gt jSt = s]
V (s) = E¼ [Gt jSt = s]
• Action value
Action
• ••Action
Action value
• Thevalue
action-value
value function Qπ(s,a) of an MDP is the
• The
• • The action-value
expected return
The action-value function
starting from
function
action-value function π Q ππ(s,a)
state
Q (s,a)
Q (s,a) s, of
of an an
taking
of
MDP anisMDP
the isa,
action
MDP isthe
the
expectedreturn
and then
expected
expected return
following
return starting
policy
starting
starting
π from
statestate
fromfrom s,s,taking
s, taking
state actionaction
taking actiona,a,
a,
andthen
and
and thenfollowing
then following policy
Q¼ (s; a)policy
following E¼π πt = s; At = a]
[Gt jS
=policy π
Q¼ (s;
¼ a) = E¼ [Gt jSt = s; At = a]
Q (s; a) = E¼ [Gt jSt = s; At = a]
Bellman Expectation Equation
BellmanExpectation
Bellman ExpectationEquation
Equation
• The state-value function Vπ(s) can be decomposed into
• • The
Thestate-value
state-valuefunction
function
immediate reward plus discountedVV π
π(s)
(s)can
canbe
bedecomposed
decomposed
value into
into state
of successor
immediatereward
immediate rewardplus
plusdiscounted
discountedvalue
valueofofsuccessor
successorstate
state

V ¼ (s) (s)E=¼ [R
V ¼= [Rt+1++°V
E¼t+1
¼
°V ¼(St+1 )jS
t+1)jS t =
t = s] s]

• •The
Theaction-value function
action-value QππQ(s,a)
function can
π(s,a) similarly
can bebe
similarly
• The action-value
decomposed
decomposed
function Q (s,a) can similarly be
decomposed
Q¼ (s; a) = E¼ [Rt+1 + °Q¼ (St+1 ; At+1 )jSt = s; At = a]
Q¼ (s; a) = E¼ [Rt+1 + °Q¼ (St+1 ; At+1 )jSt = s; At = a]
State
State Value
Value andand Action
Action Value
Value

V ¼ (s) Ã s
X
¼
V (s) = ¼(ajs)Q¼ (s; a)
a2A

Q¼ (s; a) Ã s; a

Q¼ (s; a) Ã s; a
X
¼
R(s; a) Q (s; a) = R(s; a) + ° Psa (s0 )V ¼ (s0 )
s0 2S

V ¼ (s0 ) Ã s0
Model-Free
Model-Free
Model-FreePolicyPolicy Iteration
Iteration
Policy Iteration
• Given Given
•• Given state-value
state-value function
function
state-value V(s)
V(s)V(s)
function and
andand action-value
action-value function
function
action-value function
Q(s,a),
Q(s,a),
Q(s,a), model-free
model-free
model-free policy
policy iteration
iteration
policy shall
iteration shall use
useuse
shall action-value
action-value
action-value
function
function
function

•• Greedy
• Greedy
Greedy policy
policy improvement
improvement
policy over
improvement over V(s)
V(s)V(s)
over requires
requires model
model
requires ofMDP
MDP
ofofMDP
model
n n XX oo
new ¼ new (s) 0 ¼ 0(s0 )
¼
¼ = max
(s) = arg arg max a) +a)°+ ° PsaP(ssa0 )V
R(s;R(s; (s )V(s )
a2A a2A 0
s0 2Ss 2S

Wedon’t
We don’t
We don’t know
know
know thetransition
thethe transition
transition probability
probability
probability

•• Greedy
Greedy
• Greedy policy
policy improvement
improvement
improvement overover
over Q(s,a)
Q(s,a)
Q(s,a) isismodel-free
model-free
is model-free
new
new¼ (s) = arg max Q(s; a)
¼ (s) = arg maxa2A
Q(s; a)
a2A
Generalized Policy Iteration with Action-Value Function

• Policy evaluation: Monte-Carlo policy evaluation, Q = Qπ


• Policy improvement: Greedy policy improvement?
policy improvement over V(s) requires model of MDP
n X o
Example of Greedy Action Selection
new
(s) = arg max R(s; a) + ° Psa (s0 )V ¼ (s0 )
a2A
s0 2S

We don’t know the transitionLeft:


probability Right:
• Greedy policy improvement 20% Reward = 0 50% Reward = 1
over Q(s,a) isover
policy improvement model-free 80% Reward = 5
Q(s,a) is model-free 50% Reward = 3

¼ new (s) = arg max Q(s; a)


a2A

• Given the right example


• What if the first action is to
choose the left door and
observe reward=0?
• The policy would be
suboptimal if there is no
exploration

“Behind one door is tenure – behind the other


is flipping burgers at McDonald’s.”
ɛ-Greedy Policy Exploration
ɛ-Greedy Policy Exploration
• Simplest idea for ensuring continual exploration
• All m actions
• Simplest ideaare
fortried withcontinual
ensuring non-zeroexploration
probability
With
•–All probability
m actions are 1-ɛ,
triedchoose the greedy
with non-zero action
probability
Withprobability
•–With probability 1-ɛ,
ɛ, choose an the
choose action at random
greedy action
• With probability ɛ, choose an action at random
(
²=m + 1 ¡ ² if a¤ = arg maxa2A Q(s; a)
¼(ajs) =
²=m otherwise
ɛ-Greedy
ɛ-Greedy Policy
Policy
ɛ-Greedy Improvement
Improvement
Policy Improvement
• Theorem• Theorem
• Theorem
– For any policy π, the
• Forpolicy
• For any ɛ-greedy ɛ-greedy
any ɛ-greedy
π, the policypolicy
ɛ-greedyπ, the π’ w.r.t.
ɛ-greedy
policy Q π is π
policy
π’ w.r.t. Q π’ w.r.t
an improvement,
is an improvement, i.e. V ¼0 (s) ¸i.e.
is an improvement,
i.e.
0
V ¼ (s) ¸ V ¼ (s)
V ¼ (s)
X

Q (s; 0
¼ 0 (ajs)Q¼ (s; a)
¼ 0 0 ¼ (s))¼=
Q (s; ¼ (s)) = ¼ (ajs)Q (s; a)
a2A
a2A ² X ¼
m actions ² X ¼ =m
m actions Q (s; a) + (1 ¡ ²) max Q¼ (s; a)
m actions = Q (s; a) + (1 ¡ ²) max Q¼ (s; a) a2A
a2A
m a2A
a2A ² X ¼ X ¼(ajs) ¡ ²=m
X ¸ Q (s;
X a) + (1 ¡ ²) Q¼ (s
² m ¼(ajs) ¡ ²=m ¼1 ¡ ²
¸ Q¼ (s; a) X
+ (1
a2A¡ ²) a2A Q (s; a)
m ¼ 1¡²
a2A = ¼(ajs)Q
a2A (s; a) = V ¼ (s)
X
= ¼(ajs)Q¼ (s;a2A
a) = V ¼ (s)
a2A
Generalized Policy Iteration with Action-Value Function
• One maintains both an
approximate policy and an
approximate value
function.
• Value function updated
for the current policy,
while the policy is
improved with respect to
the current value function
• however assume policy
evaluation operates on an
infinite number of
episodes
• Policy evaluation: Monte-Carlo policy evaluation, Q = Qπ
• Policy improvement: ɛ-greedy policy improvement
Monte-Carlo Control

Every episode:
• Policy evaluation: Monte-Carlo policy evaluation, Q ≈ Qπ
• Policy improvement: ɛ-greedy policy improvement
MC Control vs. TD Control
• Temporal-difference (TD) learning has several advantages over
Monte-Carlo (MC)
– Lower variance
– Online
– Incomplete sequences

• Natural idea: use TD instead of MC in our control loop


– Apply TD to update action value Q(s,a)
– Use ɛ-greedy policy improvement
– Update the action value function every time-step
SARSA
SARSA (On-Policy TD Control)
• For each
• For state-action-reward-state-action
each State-Action-Reward-State-Actionby
bythe
thecurrent
current
policy
policy
At
At state s, take
state s, takeaction
actionaa
Observe rewardrr
Observe reward
Transit to
Transit to the
thenext
nextstate
states’s’

At state
At state s’,
s’, take
takeaction
actiona’a’

• Updatingaction-value
• Updating action-value functions
functions with
withSarsa
Sarsa

Q(s; a) Ã Q(s; a) + ®(r + °Q(s0 ; a0 ) ¡ Q(s; a))


On-Policy Control with SARSA
SARSA
• For each state-action-reward-state-action by the curren
policy

At state s, take action a


Observe reward r
Transit to the next state s’

At state s’, take action a’

• Updating action-value functions with Sarsa


Every time-step and update for a single state:
• Policy evaluation: Sarsa Q(s; a) Ã Q(s; a) + ®(r + °Q(s0 ; a0 ) ¡ Q(s; a))
• Policy improvement: ɛ-greedy policy improvement
SARSA Algorithm

• NOTE: on-policy TD control sample actions by the current policy, i.e.,


the two ‘A’s in SARSA are both chosen by the current policy
SARSA Example: Windy Gridworld
Standard gridworld
with start and goal
states, but with a
crosswind upward
through the middle
of the grid (may
shift up with
unknown
environment, see
the numbers below )

• Reward = -1 per time-step until reaching goal


• Undiscounted
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.
SARSA Example: Windy Gridworld
optimal a trajectory

With exploration,
average episode
length 17 steps (15 is
the minimum)

Note: as the training proceeds, the Sarsa policy achieves the goal more and
more quickly
Off-PolicyOff-Policy
LearningLearning
• Evaluate
Evaluate target
target policy π(a|s)
policy π(a|s) to compute Vπ(s) orVQππ(s)
to compute (s,a)or Qπ(s,a)
• While following behavior policy (used to simulate trajectories)
Whileμ(a|s)
following behavior policy μ(a|s)
fs1 ; a1 ; r2 ; s2 ; a2 ; : : : ; sT g » ¹

• Why off-policy learning is important?


Why off-policy
• Learn fromlearning is important?
observing humans or other agents
• • Re-use
Learn experience generated
from observing humansfrom old policies
or other agents
• Learn about optimal policy while following exploratory policy
• Re-use experience generated from old policies
• Learn about multiple policies while following one policy
• Learn about optimal policy while following exploratory policy
• Learn about multiple policies while following one policy
• My research in MSR Cambridge
Importance Sampling
Importance
Importance SamplingSampling
• Estimate the expectation of a different distribution
• •Estimate
Estimatethe
theexpectation
expectationof
of aa different
different distribution
distribution
Z
Z
Ex»p [f (x)] = p(x)f (x)dx
Ex»p [f (x)] = x p(x)f (x)dx
Z
Zx p(x)
= q(x)
q(x) p(x) f(x)dx
(x)dx
= x q(x)f
x q(x)
hhp(x)
p(x) ii
=E
= Ex»q
x»q ff(x)
(x)
q(x)
q(x)

p(x)
• Re-weight
••Re-weight each
Re-weight each instance
each instance by
by ¯(x)
instance by ¯(x)==p(x)
q(x)
q(x)
Importance Sampling for Off-Policy Monte-
Carlo
Importance Sampling for Off-Policy Monte-Carlo
• Use returns generated from μ to evaluate π
• Weight return
• Use
Importance Ggenerated
t according
returnsSampling from μto importance
to evaluate
for Off-Policy π ratio between
Monte-Carlo
policies
• Weight return Gt according to importance ratio between
policies
• Multiply
• Use importance
returns generated ratio
from μ along with
to evaluate π episode
• •Weight
Multiply importance
return ratiotoalong
Gt according with episode
importance ratio between
policies fs1 ; a1 ; r2 ; s2 ; a2 ; : : : ; sT g » ¹
• Multiply importance ¼(a ratio along js
js ) ¼(a with)episode ¼(a js )
¼=¹ t t t+1 t+1 T T
Gt = ¢¢¢ G
fs¹(a
1 ; a1t ;jsrt2); ¹(a
s2 ; at+1
2 ; :js
: t+1 ¹ T jsT ) t
: ; s)T g »¹(a

• Update ¼(at jst ) ¼(at+1 jst+1 ) ¼(aT jsT )


Gt value
= towards corrected ¢return
¼=¹
¢¢ Gt
¹(at jst ) ¹(at+1 jst+1 ) ¹(aT jsT )
¼=¹
• Update value towards
V (st ) Ãcorrected
V (st ) + ®(Gtreturn
• Update value towards corrected return
¡ V (st ))
• Cannot use if μ is zero when π is non-zero
¼=¹
V (s
• Importance t ) Ã can
sample V (sdramatically
t ) + ®(Gt increase
¡ V (stvariance
))
• Cannot use if μ is zero when π is non-zero
• Cannot use if μsample
• Importance is zero
can when π isincrease
dramatically non-zero
variance
• Importance sample can dramatically increase variance
Importance
Importance Sampling
Sampling for Off-Policy
for Off-Policy TD TD

• Use TDTD
• Use targets
targetsgenerated
generated from
from μμ to
toevaluate
evaluateπ π
• Weight TDTD
• Weight target
targetr+γV(s’)
r+γV(s’) by
by importance sampling
importance sampling
• Only need
• Only a single
need a singleimportance samplingcorrection
importance sampling correction
μ ¶
¼(at jst )
V (st ) Ã V (st ) + ® (rt+1 + °V (st+1 )) ¡ V (st )
¹(at jst )

importance TD
sampling target
correction

• Much lower variance than Monte-Carlo importance sampling


• Much lower variance than Monte-Carlo importance sampling
• Policies only need to be similar over a single step
• Policies only need to be similar over a single step
Q-Learning
Q-Learning
Q-Learning
Q-Learning
• For off-policy

learning of action-value Q(s,a)

• For off-policy learning of action-value Q(s,a)Q(s,a)
For For off-policy
off-policy learning
learning ofof action-value
action-value Q(s,a)
• No importance
• No sampling
• importance
No importance is required
sampling
sampling is isrequired(why?)
required (why?)
(why?)
• No importance sampling is required (why?)
• The next
• The• action
The
next nextisaction
chosen
action is chosenusing
is chosenusing behavior
using behaviorpolicy
behavior policy aat+1
policy »»¹(¢js
¹(¢jst )t )
• The next action is chosen using behavior policy at+1 » t+1¹(¢js
t )
• But we• But•
consider
But we alternative
consider
we consider alternativesuccessor
alternative successoraction
successor action a » ¼(¢js
action a » ¼(¢jst ) t )
• But we consider alternative successor action
• And update Q(st,at) towards value of alternative a » ¼(¢js t)
action
• And •update
And updateQ(st,a t) ttowards
Q(s ,at) towards value ofofalternative
value alternative action
action
• And update Q(st,at) towards value of alternative action
Q(st ; at ) Ã Q(st ; at ) + ®(rt+1 + °Q(st+1 ; a00 ) ¡ Q(st ; at ))
Q(st ; at ) Ã Q(st ; at ) + ®(rt+1 + °Q(st+1 ; a ) ¡ Q(st ; at ))
Q(st ; at ) Ã Q(st ; at ) + ®(rt+1 + °Q(st+1 ; a0 ) ¡ Q(st ; at ))
action
action
from π
action from
not μπ
not μ
from π
not μ
Off-Policy Control with Q-Learnin
Off-Policy Control with Q-Learning
• Allow both behavior
Off-Policy and target
Control with policies
Q-Learningto improve
Off-Policy
• Allow Control
both behavior with
and target Q-Learning
policies to improve
• The• target policy π is greedy w.r.t. Q(s,a)
The target policy π is greedy w.r.t. Q(s,a)
• Allow both behavior and target policies to improve
• Allow both behavior and target policies to improve
• The target policy π is greedy w.r.t. Q(s,a) 0
¼(st+1
• The target policy π is )greedy
= arg max
w.r.t. Q(s,a)
0
Q(s t+1 ; a )
¼(st+1 ) = arg max Q(sa t+1 ; a0 )
a00
The behavior policy μ is e.g. ɛ-greedy policy w.r.t. Q(s,a)
• The• behavior
¼(st+1 ) = arg max Q(s t+1 ; a )
policy
• The behavior policy μμ isise.g.
0
e.g.a
ɛ-greedy
ɛ-greedy policy
policy w.r.t. Q(s,a) w.r.t. Q(s,a)
• The Q-learning
• •The
Thebehavior target
policy
Q-learning then
μ isthen
target simplifies
e.g.simplifies
ɛ-greedy policy w.r.t. Q(s,a)
• The Q-learning
• The Q-learningtarget
target
0
then
then simplifies
simplifies 0
rt+1 + °Q(st+1 ; a ) = rt+1 + °Q(st+1 ; arg max
0
Q(st+1 ; a ))
a
rt+1 + °Q(st+1 ; a00 ) ==rrt+1 + °Q(s ; arg max 0 Q(st+1 ; a ))
0 0
rt+1 + °Q(st+1 ; a ) = t+1 rt+1
+°+ max
a0
°Q(s
t+1Q(s t+1 ; a
t+1a 0) ; arg max 0
Q(st+1 ; a )
= r + ° max Q(s ; a 0
) a
• Q-learning update t+1
a0
t+1
0
• Q-learning
• Q-learning
update
Q(st ; at ) Ãupdate = rt+1
Q(st ; at ) + ®(rt+1 +
+ ° max ° max
Q(s t+1 ;Q(s
a 0
) ¡ t+1
Q(st
;
; a
a t )) )
a0 a 0
0
Q(st ; at ) Ã Q(st ; at ) + ®(rt+1 + ° max Q(s ; a ) ¡ Q(st ; at ))
• Q-learning update 0 a
t+1

0
Q-Learning
Q-Learning
Q-Learning Control
Control
Control Algorithm
Algorithm
Algorithm
At state s, take action a
At state
At state s, take
s, take action
action a a
Observe reward r
Observe
Observe reward
reward r r
Transit to the next state s’
Transit
Transit tonext
to the the next
statestate
s’ s’

At state s’, take action argmax Q(s’,a’)


At state s’, take
At state action
s’, take argmax
action Q(s’,a’)
argmax Q(s’,a’)
0 0 Q(s ; a ))
Q(sQ(s
t ; att); a
Ãt ) Q(s
à Q(s
t ; a t )
;
t t+
a ®(r
) + ®(r
t+1 + °
t+1 max
+ °
0
Q(s
max Q(s
t+1 ; a
t+1 ); ¡
a ) ¡ Q(s
t tt ; at ))
a 0
a

Theorem:
•• Theorem:
• Theorem: Q-learning
Q-learning control
control
Q-learning converges
converges
control totothe
to the
converges theoptimal
optimal
optimal
action-value
action-value function
function
action-value function
¤
Q(s;Q(s;
a) !a) !(s;
Q Q¤a)
(s; a)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8.3-4 (1992): 279-292.
Q-Learning Control Algorithm
Q-Learning Control Algorithm
Q-Learning Control Algorithm
At state s, take action a
At state s, take action a
Observe
At state s,reward r a
take action
Observe reward r
Observe reward r
Transit to the next state s’
Transit to the next state s’
Transit to the next state s’

AtAtstate
states’,s’, take
take action
action argmax
argmax Q(s’,a’)
Q(s’,a’)
At state s’, take action argmax Q(s’,a’)
0
Q(st ; at ) Ã Q(st ; a t ) + ®(r t+1 + °max
Q(st ; at ) Ã Q(st ; at ) + ®(rt+1 + °max Q(st+1 ; a )0 ¡ Q(st ; at ))
a0 0 Q(st+1 ; a ) ¡ Q(st ; at ))
a
Why
•• •Why Q-learning
WhyQ-learning isisan
Q-learningis anoff-policy
an off-policycontrol
off-policy controlmethod?
controlmethod?
method?
• •Learning
– Learning from
Learningfrom SARS
fromSARS generated
SARSgenerated
generatedbyby another
byanother
anotherpolicy
policyμμ
• •The
– The first
Thefirst action
actionaaaand
firstaction and
andthe the corresponding
thecorresponding
correspondingreward
rewardrrare
arefrom
fromμμ
• •The
– Thenext
The nextaction
next actiona’
action a’a’isisispicked
pickedby
picked bythe
by thetarget
the targetpolicy
target policy
policy¼(s
¼(st+1 ) = arg max
t+1 ) = arg max
0
aa0
Q(s
Q(s t+1
t+1
;;aa0 )0 )

•• •Why
Why no
Whyno importance
importancesampling?
noimportance sampling?
sampling?
• •Action
– Actionvalue
Action valuefunction
value functionnot
function notstate
not statevalue
state valuefunction
value function
function
SARSA vs. Q-Learning Experiments
SARSA
• Cliff-walking
– Undiscounted reward Q-learning
ARSA – Episodic task
– Reward = -1 on all
transitions
or each state-action-reward-state-action by the current
olicy
– Stepping into cliff area
incurs -100 reward
At state and
s, take action a
sent theObserve
agentreward backr to

Q-Learning Control
the startTransit Algorithm
to the next state s’

• Why the results


At stateAt
are a’
s’,state
take action
s, take action a
like this? Observe reward r ɛ-greedy policy with ɛ=0.1
pdating action-value functions with Sarsa
SARSA: Transit to the next state s’
Q(s; a) Ã Q(s; a) + ®(r + °Q(s0 ; a0 ) ¡ Q(s; a))
At state s’, take action argmax Q(s’,a’)
Q-learning:
Q(st ; at ) Ã Q(st ; at ) + ®(rt+1 + ° max
0
Q(st+1 ; a0 ) ¡ Q(st ; at ))
a
On-Policy Monte-Carlo Control
Markov Decision Process
On-Policy TD Learning
Planning by Dynamic Programming
O↵-Policy Learning
Model-free Prediction
Value Function Approximation
Model-free Control

Step-by-step
Step-by-step example
example of Q-learningof Q-learning
• Agents are expected to find a route outside the room. Once
going to
Agents arethe outside,
expected toi.e.,
findNo.5, reward
a route 100.
outside the room. Once going
to the outside, i.e. No.5, reward 100.
• Each room, including outside, is a state, and the agent’s move
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
from one room to another will be an action.

Each room, including outside, is a state, and the MDP


agent’s move
from one room to another will be an action.
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
146 / 185
Model-free Prediction
Value Functio
Model-free Control

Step-by-step example of Q-learning


Step-by-step example of Q-learning
Introduction
On-Policy Monte-Carlo Control
Markov Decision Process
On-Policy TD Learning
Planning by Dynamic Programming
O↵-Policy Learning
Model-free Prediction
Value Function Approximation
Agents are expected to find a route outside
Model-free Control

-by-step example of Q-learning (2)


• The reward
he reward table is then
table: to the outside, i.e. No.5, reward 100.
http://mnemstudio.org/path-finding-q-learn

Each room, including outside, is a state, an


• We use -1 in the table to represent no
1 in the table represents no connection between nodes. from one room to another will be an action

connection between nodes.


147 / 185
146 / 185
Introduction
On-Policy Monte-Carlo Control
Markov Decision Process
On-Policy TD Learning
Planning by Dynamic Programming
O↵-Policy Learning

Step-by-step example of Q-learning


Model-free Prediction
Value Function Approximation
Model-free Control

tep-by-step example
• Agent can of Q-learning
learn through experience.(3)
• The
Agent canagent
learn can passexperience.
through from one room to another
The agent butfrom
can pass hasone
no knowledge of the environment (model), and doesn’t
roomknow
to another
whichbut has no knowledge
sequence of the
of doors lead to environment,
the outside. and
Introduction
On-Policy Monte-Carlo Control
Markov Decision Process

doesn’t know which sequence of doors lead to the outside.


On-Policy TD Learning
Planning by Dynamic Programming
O↵-Policy Learning
Model-free Prediction
Value Function Approximation
Model-free Control

Step-by-step example of Q-learning (3)


Agent can learn through experience. The agent can pass from one
room to another but has no knowledge of the environment, and
doesn’t know which sequence of doors lead to the outside.

2 3
• We use -1 in the table0 to0represent
0 0 0 no 0 connection
between nodes. 6
2 3
00 00 0 0
0 0 00 0 0 7
66 7 7
6 666000 000 00 000 00 000 777 0 0 7
Q0 =Q06
=
6 66400 00 0 00 0 00 775 0 0 7
7
6 0 0 0 0 0 0 7
4 00 00 0 00 0 00 0 0 5
148 / 185
Introduction
On-Poli
Markov Decision Process

Step-by-step example of Q-learning


On-Poli
Planning by Dynamic Programming
O↵-Pol
Model-free Prediction
Value F
Model-free Control

Step-by-step example of Q-learnin


Agents are expected to find a route out
to the outside, i.e. No.5, reward 100.
http://mnemstudio.org/path-finding-q-l

• Set γ = 0.8, the initial state as Room 1.


1 in the table represents no connection between nodes
147 / 185

– Look at the second row (state 1) of matrix R.


Each room, including outside, is a state

• There are two possible actions for the current state 1: from one room to another will be an ac
146 / 185

go to state 3, or go to state 5.
• By random selection, we select to go to 5 as our action.
– Now let’s imagine what would happen if our agent
were in state 5.
• Look at the sixth row of the reward matrix R (i.e. state
5). It has 3 possible actions: go to state 1, 4 or 5.
Q(state, action) = R(state, action) + Gamma ∗ Max[Q(nextstate, allactions)]

Q(1,5) = R(1,5) + 0.8 ∗ Max[Q(5,1),Q(5,4),Q(5,5)] = 100 + 0.8 ∗ 0 = 100


Step-by-step example of Q-learning
Introduction
On-Policy Monte-Carlo Control
Markov Decision Process
On-Policy TD Learning
Planning by Dynamic Programming

• The next state, 5, now becomes the current


O↵-Policy Learning 1 in the table represents no connection between nodes
Model-free Prediction
Value Function Approximation 147 / 185
Model-free Control

state. Because 5 is the goal state, we’ve


Step-by-step example of Q-learning (5)

finished one episode.


The next state, 5, now becomes the current state. Because 5 is
• Our agent’s brain now contains an updated
the goal state, we’ve finished one episode. Our agent’s brain now
contains an updated matrix Q as:
matrix Q as: 2
0 0 0 0 0 0
3
6 0 0 0 0 0 100 7
6 7
6 0 0 0 0 0 0 7
Q1 = 6
6
7
6 0 0 0 0 0 0 7 7
4 0 0 0 0 0 0 5
0 0 0 0 0 0

Q(state, action) = R(state, action) + Gamma ∗ Max[Q(nextstate, allactions)]

Q(1,5) = R(1,5) + 0.8 ∗ Max[Q(5,1),Q(5,4),Q(5,5)]


150 / 185
= 100 + 0.8 ∗ 0 = 100
Step-by-step example of Q-learning
• For the next episode, we start with a randomly chosen initial
state, say state 3, as our initial state. 1 in the table represents no connection between nodes
147 / 185

– Look at the fourth row of matrix R; it has 3 possible actions: go to state


1, 2 or 4.
– By random selection, we selectProcessto go to state 1 as our action.
Introduction
On-Policy Monte-Carlo Control
Markov Decision
On-Policy TD Learning
Planning by Dynamic Programming
– Now we imagine that we are
Model-free in
Prediction
Model-free Control
state 1.Function
Value LookApproximation
at the second row of
O↵-Policy Learning

reward matrix R (i.e. state 1). It has 2 possible actions: go to state 3 or


state 5.
Step-by-step example of Q-learning (8)
– Then, we compute the Q value:
Q(3,1) = R(3,1)+0.8∗Max[Q(1,3),Q(1,5)] = 0+0.8∗Max(0,100) = 80
• The next state, 1,
Because 5 isnow
the becomes
goal state, the current
we finish thisstate. WeOur
episode. repeat the inner
agent’s
loop of brain
the Qnow
learning algorithm because state 1 is not the goal state.
contain updated matrix Q as:
2 3
0 0 0 0 0 0
6 0 0 0 0 0 100 7
6 7
6 0 0 0 0 0 0 7
Q2 = 6
6
7
6 0 80 0 0 0 0 7 7
4 0 0 0 0 0 0 5
0 0 0 0 0 0
Step-by-step example of Q-learning
• So, starting the new loop with the current state 1,1 in the table represents no connection between nodes
147 / 185

– there are two possible actions: go to state 3, or go to


state 5. By lucky draw, our action selected is 5.
– Now, imaging we’re in state 5, there are three possible
actions: go to state 1, 4 or 5.
– We compute the Q value using the maximum value of
these possible actions.
Q(1, 5) = R(1, 5)+0.8∗Max[Q(5, 1), Q(5, 4), Q(5, 5)] =
100+0.8∗0 = 100
• The updated entries of matrix Q, Q(5, 1), Q(5, 4),
Q(5, 5), are all zero.
• The result of this computation for Q(1, 5) is 100
because of the instant reward from R(1, 5). This
result does not change the Q matrix.
step example of Q-learning (8)
Step-by-step example of Q-learning
• Because 5 is the goal state, we finish this
1 in the table represents no connection between nodes

episode.
147 / 185

e 5• is
Ourtheagent
goal has
state,
thewe finish this
updated episode.
matrix Q as: Our agent’s
ow contain updated matrix Q as:
2 3
0 0 0 0 0 0
6 0 0 0 0 0 100 7
6 7
6 0 0 0 0 0 0 7
Q2 = 66 0 80 0 0 0 0 7
7
6 7
4 0 0 0 0 0 0 5
0 0 0 0 0 0
Introduction
On-Policy Monte-Carlo Control

Step-by-step example of Q-learning


Markov Decision Process
On-Policy TD Learning
Planning by Dynamic Programming
O↵-Policy Learning
Model-free Prediction
Value Function Approximation
Model-free Control

• IfStep-by-step
our agent learns moreofthrough
example further
Q-learning (9)episodes, it
will finally reach convergence values in matrix Q.
• Once theagent
If our matrix Q gets
learns close enough
more through to a state
further episodes, offinally
it will
convergence, we know
reach convergence valuesour agentQ.has
in matrix Oncelearned
the matrixtheQmost
gets
optimal
close paths
enough toto athe goal
state state.
of convergence, we know our agent has
learned the most optimal paths to the goal state. Tracing the best
• Tracing the best
sequences sequences
of states ofasstates
is as simple is as
following thesimple
links with asthe
following
highest the links
values withstate.
at each the highest values at each
state.

154 / 185
Relationship Between DP and TD
Relationship Between DP and TD
Full Backup (DP) Sample Backup (TD)
V ¼ (s) Ã s
s; a
r
Bellman
a s0
Expectation r
Equation for V ¼ (s0 ) Ã s0 s0 ; a0

V ¼ (s)
Iterative Policy Evaluation TD Learning
Q¼ (s; a) Ã s; a s; a
r
Bellman r
s0 s0
Expectation
Equation for s0 ; a0
Q¼ (s0 ; a0 ) Ã s0 ; a
Q¼ (s; a)
Q-Policy Iteration SARSA
Q¤ (s; a) Ã s; a
s; a
Bellman r r
Optimality s0 s0

Equation for
Q¤ (s0 ; a0 ) Ã s0 ; a0 s0 ; a0
Q¤ (s; a)
Q-Value Iteration Q-Learning
RelationshipBetween
Relationship BetweenDP
DPand
andTDTD
Full Backup (DP) Sample Backup (TD)

Iterative Policy Evaluation TD Learning


®
V (s) Ã E[r + °V (s0 )js] ¡ r + °V (s0 )
V (s) Ã

Q-Policy Iteration SARSA


®
Q(s; a) Ã E[r + °Q(s0 ; a0 )js; a] ¡ r + °Q(s0 ; a0 )
Q(s; a) Ã

Q-Value Iteration Q-Learning


h i ®
0 0 0 0
Q(s; a) Ã E r + ° max Q(s ; a )js; a Q(s; a) Ã
¡ r + ° max
0
Q(s ;a )
0 a a
where
®

¡ y ´ x à x + ®(y ¡ x)
Planning by Dynamic Programming
O↵-Policy Learning
Model-free Prediction
Value Function Approximation
Model-free Control

Convergence
onvergence Proof
Proof of Q-learning of Q-learning
• Recall that the optimal Q-function (SxA → R) is a
The optimal Q-function is a fixed point of a contraction
fixed point
operator of aforcontraction
T , defined operator
a generic function T: R as
Introduction
q : SxA !
Markov Decision Process
Planning by Dynamic Programming
On-Policy Monte-Carlo Control
On-Policy TD Learning
O↵-Policy Learning
Model-free Prediction
Value Function Approximation
X Model-free Control

(T ⇤
Q) (s, a) , r (s, a) + P
Convergence Proof of Q-learning s 0
|s, a max Q s 0 0
,a
a0 2A
S0

Let{s{st }t}be
• Let beaoptimal
The a sequence
sequence is aof
of states
Q-function states
obtained
fixed aby following
point of following
contraction policy policy
π,{a
⇡, {att}} the
the sequence
operator T , defined
sequence of corresponding
for a generic function
of corresponding
X
q : SxA
actions and!{r actions
R as
t } the and
{rt} the sequence
sequence of (T Q) (s, a) , rof
obtained
⇤ rewards. rewards.
(s, a) + Then, P given
s |s, a any
max Qinitial
s , a estimate
0
a0 2A
0 0

Q0 , Q-learning uses the following update rule: S0


• Then, given {s }
any initial estimate Q
Qt+1 (st , at ) = Qt (st , at )+↵t (st , at )[rt + max Qt (st+1policy
Let be
t a sequence of states obtained 0 ,
following
Q-learning
, b) Qt (st , at )].
uses the following update rule:
⇡, {a }t the sequence of corresponding actions
b2A and {r } the t
sequence of obtained rewards. Then, given any initial estimate
Q0 , Q-learning uses the following update rule:
139 /(s
Qt+1 (st , at ) = Qt (st , at )+↵ 185
t t , at )[rt + max Qt (st+1 , b) Qt (st , at )].
b2A
onvergence
Let {stProof of Q-learning
} be a sequence (2)
of states obtained following policy
⇡, {at } the sequence of corresponding actions and {rt } the
Convergence Proof of Q-learning
sequence of obtained rewards. Then, given any initial estimate
Q0 , Q-learning uses the following update rule:
Qt+1 (s att ,) a=
t , (s
Qt+1 Qt (s
t) = Q
, a )+↵tt(s
t (stt , att)+↵t (s , at t,)[ratt+
)[rmax
t + Qmax Qt (st+1
t (st+1 , b) Qt (s,t ,b)
at )]. Qt (st
b2A b2A
Theorem:
Theorem (Main theorem)
Given a finite MDP, the Q-learning algorithm given by
139 / 185

Given a finiteupdate
the above MDP, the
ruleQ-learning
converges algorithm
w.p.1 togiven
the by the above
optimal
update rule converges
Q- function as long w.p.1
as to the optimal Q- function as long as
X X
↵t (s, a) = 1, ↵t2 (s, a) < 1
t t

For all(s,
for all (s, a)
a)∈S×A.
2 S ⇥ A.
– At the (t + 1)th update, only the component (st, at) is updated. The
theorem suggests
The step-sizes thatmeans
↵t (s, a) all state-action
that, atpairs
thebe(tvisited
+ 1)th infinitely
update, often.
only
the component (st , at ) is updated. The theorem suggests that all
state-action pairs be visited infinitely often.
140 / 185
Introduction
On-Policy Monte-Carlo Control
Markov Decision Process
On-Policy TD Learning
Introduction
Planning by Dynamic Programming On-Policy
O↵-Policy LearningMonte-Carlo Control

Convergence Proof of Q-learning


Markov
Model-free Decision Process
Prediction On-Policy
Value Function TD Learning
Approximation
PlanningModel-free
by Dynamic Programming
Control O↵-Policy Learning
Model-free Prediction
Value Function Approximation
Model-free Control
Convergence Proof of Q-learning (3)
Theorem (Stochastic
Convergence Proof Approximation):
of Q-learning (3)
TheTheorem
•Theoremrandom process
(Stochastic {△
optimisation)
t } taking
(Stochastic optimisation)
values in R n and

defined
The random as
process {4 t } taking values in R n and defined as
The random process {4 } taking values in Rn and defined as
t
4t+1 (x) = (1 ↵t (x))4t (x) + ↵t (x)Ft (x)
4t+1 (x) = (1 ↵t (x))4t (x) + ↵t (x)Ft (x)
converges
converges to tozerozero
w.p.1w.p.1
under under the following
the following assumptions:
assumptions:
convergesP
•0  ↵t  1, t ↵P
to zero w.p.1 under P the following
2 ℱ# : All assumptions:
the past information at t
t (x) = 1 and t (x) 2< 1
t ↵P
•0t (x)|F
• kE [F  ↵t t 1,  t ↵kt (x)
]kW t k =1, and <t 1↵t (x) < 1
with
⇣ W ⌘
• kE [Ft (x)|Ft ]kW ⇣ k t2kW , with ⌘ C ><0 1
• var [Ft (x)|Ft ]  C 1 + k t kW , 2for
• var [Ft (x)|Ft ]  C 1 + k t kW , for C > 0
• Tommi Jaakkola,
Tommi Jaakkola, MichaelMichael
I. Jordan, andI.Satinder
Jordan, and
P. Singh. On Satinder P. ofSingh.
the convergence stochasticOn the
iterative
dynamic programming
Tommi algorithms.
Jaakkola, Neural Computation,
Michael I. Jordan, 66):1185-1201, 1994. P. Singh. On
and Satinder th
convergence of stochastic iterative dynamic programming
Model-freeIntroduction
Control
On-Policy Monte-Carlo Control
Markov Decision Process
ergence Proof
onvergence ProofofofQ-learning (4)
Q-learning (4)
Planning by Dynamic Programming
Model-free Prediction
On-Policy TD Learning
O↵-Policy Learning

Convergence Proof of Q-learning


Value Function Approximation
Model-free Control

Convergence
e We
want to to
want useuse Proof
the
the of Q-learning
stochastic
stochastic (4)theorem
optimisation
optimisation theoremtotoprove
provethethe
nverge Stochastic
converge
• We want to useapproximation
of of Q-learning.
Q-learning. theorem
the stochastic optimisation helps
theorem to prove the
prove the
e We start
start by by rewritingthe
rewriting theQ-learning
converge of Q-learning Q-learning update
update rule
ruleasas
converge of Q-learning.
We Start
– start Qby
by
Qt+1 (srewriting
(s , a )the
rewriting
t+1 =the
(1 Q-learning
(stt ,,update
↵↵tt(s
Q-learning
t , t at )t = (1 update
aatt))Q
))Q tt(s
rule , arule
(st ,tas ta)t ) as

Q
+↵t+1
t (stt,, att )[r
(s ) =t +
(1 max
↵t (sQ at t+1
t ,t (x ))Q,t (sb)]
t , at )
+↵t (st , at )[rt + max b2A Qt (xt+1 , b)]
+↵t (st , at )[rt + b2Amax Qt (xt+1 , b)]
b2A
⇤ (s , a ) and letting
Subtracting from both sides the quantity Q
– Subtracting from both sides the quantity
⇤ t t Q∗(s , a ) and
btracting from both sides the quantity Q ⇤(st , at ) andt letting t
letting from
Subtracting both sides the quantity ⇤Q (st , at ) and letting
4t (s, a) = Qt (s, a) Q ⇤(s, a)
4t (s, a) = Q (s, a) Q (s, a)
4 (s, a) =t Q (s, a) Q ⇤ (s, a)
t t
yields
– We thus
lds yields 4t (shave:
t , at ) = (1 ↵t (st , at ))4t (st , at ))
4t (s4 att), =
t , t (s at )(1 (stt(s, t ,ata))4
= (1 ↵t ↵ t ))4tt(s
(stt , aatt⇤))
))
+↵t (s, a)[rt + max Qt (st+1 , b) Q (st , at )].
b2A ⇤⇤
+↵ta)[r
+↵t (s, (s, a)[r
t + t + max
max QQ t (s
t (s b)
t+1,, b)
t+1 Q
Q (s(st ,t a, ta)].
t )].
b2A b2A
142 / 185
142 / 185
142 / 185
Planning by Dynamic Programming
Introduction O↵-PolicyMonte-Carlo
On-Policy Learning Control
Model-free
Markov Prediction
Decision Process
Planning by Dynamic Programming
Value Function
Introduction On-Policy Approximation
TD Learning
Model-free Control On-Policy
O↵-Policy Monte-Carlo Control
Learning
MarkovPrediction
Model-free Decision Process
On-Policy TD Learning

Convergence Proof (5)


of Q-learning
Planning by Dynamic Programming Value Function Approximation
Model-free Control O↵-Policy Learning
nvergence Proof of Q-learning Model-free Prediction
Model-free Control
Value Function Approximation

Convergence Proof of Q-learning (5)


Convergence Proof of Q-learning (5)
If •weIfwrite
we write
If we write
If we write
Ft (s, a) = r (s, a, S(s, a)) + max Q (y , b) ⇤Q ⇤ (s, a),
Ft (s, a)Ft=
(s,ra)
(s,=a,r (s,
S(s, a)) +
a, S(s, maxmax
a)) + Qt (y
t ,(yb)
Q t , b)
QQ(s,
⇤ a),
(s, a),
b2A
b2Ab2A

– where
where S(s, S(s,a) isis aa random sample state obtained from the
where S(s, a) a)
where is
a arandom
isS(s, a)random sample
sample
random state
state
sample obtained
state obtainedfrom
obtained from
from the Markov
the
the Markov
Markov
chainMarkov
(S, P
chain ),chain
(S, we
P have
), from
we havethe transition model, we have
chain (S, Pa ), we have
a a
X X 

E [Ft (s,Ea)|F
[Ft (s,
t ] =X
a)|F t ] = P a P
(s, ya (s,
) y
r )
(s, r (s,
a, y a,
) y
+ ) +maxmax
Q Q
t (y
t (y
, ,
b)b) Q
Q ⇤⇤ (s, a)
(s,⇤a)
E [Ft (s, a)|Ft ] = y 2SPay(s, 2S y ) r (s, a, y ) + b2A max Qt (y , b) Q (s, a)
b2A
b2A
y 2S ⇤
= (TQ
= (TQ t ) (s, a) Q a)
Q ⇤ (s, (s, a)
t ) (s, a)

– Using =
Usingthe
(TQ
the fact t )
that
fact that
(s,
Q
a)
Q
⇤ ∗
= = Q
TQ
TQ ⇤ ∗(s,
, , a)

Using the fact that Q = TQ , ⇤

Qt⇤(s,=a)|F
Using the fact thatE[F TQt ]⇤ ,= (TQt )(s, a) (TQ


)(s, a) .
E[Ft (s, a)|Ft ] = (TQt )(s, a) (TQ )(s, a) .
E[Ft (s, a)|Ft ] = (TQt )(s, a)
143 / 185
(TQ ⇤ )(s, a) .
143 / 185
onvergence Proof of Q-learning (6)
Convergence Proof of Q-learning
m •Slide. 86that
Recall we know
we that that
know this operator
this is a contraction
operator is a in t
From Slide.
-norm, i.e. 86 we know that this operator is a contraction in the
contraction
sup-norm, i.e. in the max-norm, i.e.
⇤ ⇤ ⇤ (k) ⇤⇤ ⇤ k ⇤
kQk kQkQ kQ1k1
= = (T(T )) Q
⇤ T
(k)

Q00 T Q Q  k kQ
kQ0 Q 0 k1 k1
⇤ Q
1 1

So,
• So
kE [Ft (s, a)|Ft ]k1  kQt Q ⇤ k⇤1 = k t k1
kE [Ft (s, a)|Ft ]k1  kQt Q k1 = k t k1
Condition 2 is met.
ndition 2 is met.

144 / 185
Planning by Dynamic Programming X TD Learning
On-Policy
Planning by Dynamic E [Ft (s, a)|F
Programming O↵-Policy
] =
t O↵-PolicyPLearning
(s, y ) r (s, a, y ) + max Qt (y , b)
aLearning Q ⇤ (s, a
Model-free Prediction
Model-free Prediction
Value yFunction
Value Approximation b2A
Function Approximation
2S
Model-free Control
Model-free Control

Convergence Proof of (7) Q-learning


= (TQt ) (s, a) Q ⇤ (s, a)
Convergence
ergence ProofProof of Q-learning
of Q-learning
Using the fact that(7)
Q = TQ , ⇤ ⇤

(TQ ⇤ )(s, a) .
• for condition 3:
E[Ft (s, a)|Ft ] = (TQt )(s, a)

Finally, for condition 3: 143 / 185

ally, for condition


var [F
⇥t (s)|Ft ] 3: 2⇤
⇤ ⇤
= E ⇥(r (s, a, S(s, a)) + maxb2A Qt (y , b) Q (s, a) (TQ t ) (s, a) + Q (s, a))
[F⇥t (s)|F t] 2⇤
= E (r (s, a, S(s, a)) + maxb2A Qt (y , b) (TQ ⇤ t ) (s, a)) Remove Q* ⇤
E ⇥(r (s, a, S(s, a)) + max b2A Q t (y , b) Q (s, a) (TQ
= var [r (s, a, S(s, a)) + maxb2A Qt (y , b)|Ft ] By definition of ⇤ t ) (s,
variance
a) + Q (s, a)
E (r (s, a, S(s, a)) + maxb2A Qt (y , b) (TQt ) (s, a))2
var [rwhich,
(s, a, duea))
S(s, to +
the max
fact b2A
thatQrt (y
is ,bounded,
which depends on Qt at most⇣linearly and b)|F t ] clearly verifies
r is also

ch, due bounded
to the (assumption).
factvarthat ]Thus
 C the
r is tbounded,
[Ft (s)|F 1 +variance
clearly 2
k t kW verifiescondition
holds: ⇣ ⌘
for some constant C .
2
var [F (s)|F
Meeting all 3 conditions,
t ] 
then,
t C 1 + k k
t 4
by Theorem 2, Wt converges to 0
w.p.1, i.e. Qt converges to Q ⇤ w.p.1. ⌅
someMelo,
constant C.
Francisco S. "Convergence of Q-learning: A simple proof." Institute Of
etingSystems
all 3 conditions, then,
and Robotics, Tech. Rep by Theorem
(2001): 1-4.
145 / 185
2, 4t converges to 0
Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
• Convergence Analysis
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
– Convergence Analysis
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
Reinforcement learning
• Computerised agent: Learning what to do
– How to map situations (states) to actions so as to
maximise a numerical reward signal

Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
Reinforcement learning
• {state s, action a, reward r} are made in stages
The outcome of each decision is not fully predictable but can be
observed before the next decision is made
• The objective is
– to maximize a numerical measure of total reward

• Decisions cannot be viewed in isolation:


– need to balance desire for
immediate reward with possibility
of high future reward

Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
Q-learning
• For game, the future discounted reward at time t
Rt = rt + γ rt+1 + γ 2 rt+1..., where γ is discount factor (aka prob. of terminating)

• We define the optimal action-value function


Q* (s, a) = max π [rt + γ rt+1 + γ 2 rt+1 +... | st = s, at = a, π ], where π = p(a | s) is a policy

• Bellman equation
Q* (s, a) = Es'~ε [r + γ max a' Q* (s, a) | st = s, at = a],
where ε is sample distribution from an emulator
• So iteratively update Q* as above
and *
a = max Q (s , a )
t a
*
t t

Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
A simplified Q-learning algorithm

Maintain a Q table
and keep updating it

http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
, a0
We address these instabilities with a novel variant of Q-learning, which
.
uses two key ideas. First, we used a biologically inspired mechanism
o
Deep Q-learning
termed experience replay21–23 that randomizes over the data, thereby
n
removing correlations in the observation sequence and smoothing over
s
changes in the data distribution (see below for details). Second, we used
n
an iterative update that adjusts the action-values (Q) towards target
-
values that are only periodically updated, thereby reducing correlations
d
l • However, in many cases, it is not practical as
with the target.
While other stable methods exist for training neural networks in the
-
e
– Action
reinforcement x State
learning setting, such space is very
as neural fitted large
Q-iteration24
, these
methods involve the repeated training of networks de novo on hundreds
k
d
– lacking
of iterations. Consequently, generalisation
these methods, unlike our algorithm, are
too inefficient to be used successfully with large neural networks. We
-
• Q function is commonly approximated by a
parameterize an approximate value function Q(s,a;hi) using the deep
p function, either linear or non-linear
convolutional neural network shown in Fig. 1, in which hi are the param-
eters (that is, weights) of the Q-network at iteration i. To perform
a experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1)
s Q(s, t in a; ) ≈set *
Q D(s,
at each time-step aθ data t 5a) where
{e1,…,e θ is model
t}. During learning,parameter
we
k apply Q-learning updates, on samples (or minibatches) of experience
s
,
• It can be trained by minimising a sequence of loss
(s,a,r,s9) , U (D), drawn uniformly at random from the pool of stored
f function
samples. The Q-learning update at iteration i uses the following loss
function:
s "# $ 2 # where U(D) is sample distribution from
n 0 0 {
e Li ðhi Þ~ ðs,a,r,s0 Þ*UðD Þ rzc max Q(s ,a ; hi ){Qðs,a; hi Þ a pool of stored samples and θ i− only updates
0 a
every C steps
l
s in which c is the discount factor determining the agent’s horizon, targeth are
i
{
s the parameters of the Q-network at iteration i and hi are the network
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
y parameters used to compute the target at iteration i. The target net-
, a0
We address these instabilities with a novel variant of Q-learning, which
.
uses two key ideas. First, we used a biologically inspired mechanism
o
Deep Q-learning
termed experience replay21–23 that randomizes over the data, thereby
n
removing correlations in the observation sequence and smoothing over
s
changes in the data distribution (see below for details). Second, we used
n
an iterative update that adjusts the action-values (Q) towards target
-
values that are only periodically updated, thereby reducing correlations
d
l • However, in many cases, it is not practical as
with the target.
While other stable methods exist for training neural networks in the
-
e
– Action
reinforcement x State
learning setting, such space is very
as neural fitted large
Q-iteration24
, these
methods involve the repeated training of networks de novo on hundreds
k
d
– lacking
of iterations. Consequently, generalisation
these methods, unlike our algorithm, are
too inefficient to be used successfully with large neural networks. We
-
• Q function is commonly approximated by a
parameterize an approximate value function Q(s,a;hi) using the deep
p function, either linear or non-linear
convolutional neural network shown in Fig. 1, in which hi are the param-
eters (that is, weights) of the Q-network at iteration i. To perform
a experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1)
s Q(s, t in a; ) ≈set *
Q D(s,
at each time-step aθ data t 5a) where
{e1,…,e θ is model
t}. During learning,parameter
we
k apply Q-learning updates, on samples (or minibatches) of experience
s
,
• It can be trained by minimising a sequence of loss
(s,a,r,s9) , U (D), drawn uniformly at random from the pool of stored
f function
samples. The Q-learning update at iteration i uses the following loss
function:
s "# $ 2 # where U(D) is sample distribution from
n 0 0 {
e Li ðhi Þ~ ðs,a,r,s0 Þ*UðD Þ rzc max Q(s ,a ; hi ){Qðs,a; hi Þ a pool of stored samples and θ i− only updates
0 a
every C steps
l
s in which c is the discount factor determining the agent’s horizon, targeth are
i
{
s the parameters of the Q-network at iteration i and hi are the network
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
y parameters used to compute the target at iteration i. The target net-
Deep Q-learning
Algorithm 1 Deep Q-learning with Experience Replay
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights
for episode = 1, M do
Initialise sequence s1 = {x1 } and preprocessed sequenced 1 = (s1 )
for t = 1, T do
With probability ✏ select a random action at
otherwise select at = maxa Q⇤ ( (st ), a; ✓)
Execute action at in emulator and observe reward rt and image xt+1
Set st+1 = st , at , xt+1 and preprocess t+1 = (st+1 )
Store transition ( t , at , rt , t+1 ) in D
Sample random
⇢ minibatch of transitions ( j , aj , rj , j+1 ) from D
rj for terminal j+1
Set yj =
rj + maxa0 Q( j+1 , a ; ✓) 0
for non-terminal j+1
2
Perform a gradient descent step on (yj Q( j , aj ; ✓)) according to equation 3
end for
end for

Second,
Mnih learning
V, Kavukcuoglu K, Silverdirectly from consecutive
D, et al. Human-level samples
control through is inefficient,
deep reinforcement dueNature,
learning[J]. to the strong
2015, correlations
518(7540): 529-533.
between the samples; randomizing the samples breaks these correlations and therefore reduces the
Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:1312.5602, 2013.
variance of the updates. Third, when learning on-policy the current parameters determine the next
DQN results on atari games
DQN Results in Atari
The performance of DQN is normalized with respect to a professional
human games tester

Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
The Go Game
• Go (Weiqi) is an ancient game originated from
China, with a history over 3000 years
The board: 19 x 19 lines,
making 361 intersections
(`points').
The move: starting from
Black, stones are placed
on points in alternation
until the end
The goal: control the
max. territory
t a difficult game to learn, so have a fun time playing
riends. Making aeachmove that place
player causes your on
a stone stones (but not
the board your
on his opponent’s)
turn. Players aretofree
have
to no
place
their
liberties is stones
known asat any unoccupied
suicide. intersections
Usually suicide on the board.
is forbidden, but some variations
t and are commonly used throughout theofworld.
the rule
Twoallow
of for suicide, whereby the suicide move causes the stones

The Go Game: liberties


However, toonce the stonesfrom
are placed on the
andboard, they are not to be moved
es are Chinese rules and Japanese rules.without
Anotherliberties
is be removed the board it is the opponent’s turn
to play. to another location. Also the stones are not to be removed from the board at
es are basically the same, the only significant difference
will, subject to the rules explained in the following Sections. Besides, players
g of territories is done when the game ends. Sections 2
are not allowed to stack a stone on top of another stone on the board. These
ll the rules.
are the rules that make Go unique compared to most other board games,

ed
• Liberties: the unoccupied intersections (or points)
including International Chess and Chinese Chess. The beauty of Go also lies
in the simplicity of its rules.

that are horizontally or vertically adjacent to the


with the board empty. Stones are placed on the
1.3 Liberties
oard. The player holding black stones plays first, and

stone
tone on the board on his turn. Players are free to place
noccupied intersections on the board.
Diagram 1-5
ones are placed on the board, they are not to be moved
Liberties are marked as X
lso the stones are not to be removed from the board at
A chain players
les explained in the following Sections. Besides, consists of two or more stones that are connected to each other
ack a stone on top of another stone on the horizontally
board. These or vertically, but not diagonally. The liberties of a chain are
ke Go unique compared to most other board counted together as a unit. An example is Diagram 1-5, where the two black
games,
al Chess and Chinese Chess. The beauty stones
of Go also
haveliesa combined total of six liberties marked X. When white has
rules. played at all the positions marked X, such that the two black stones have no

• Stones without liberties must be removed


liberties at all, then white will remove the two stones together. At no time is
Diagram 1-2 Diagram 1-3 Diagram 1-4
white allowed to remove any of the two stones individually. As the saying
goes, "One for all, all for one".
Liberties refer toBlack
the unoccupied
captures 3 intersections (or points) that are
stones from White
horizontally or vertically adjacent to the stone. Note: points diagonally next
to a stone are not liberties of that stone. Liberties of the three black stones
Need to be are marked as X in Diagram 1-2. A stone in the middle has four liberties; a
stone at the side has three liberties; and a stone at the corner has two
removed liberties.
(thus captured)

Diagram 1-3 Diagram 1-4 Diagram 1-6 Diagram 1-7 Diagram 1-8
One of the great challenges of AI
• It is far more complex than
chess
– Chess has 3580 possible moves
– Go has 250150 possible moves
• Computers match or surpass
top humans in chess,
backgammon, poker, even https://jedionston.files.wordpress.com/20
Jeopardy 15/02/go-game-storm-trooper.jpg

• But not Go game as 2016


AlphaGo
Diagram 1-5 Diagram 1-5

• Yet another example of deep reinforcement


A chain consists of two or more stones that are connected toA each chainother
consists of two or more stones that are connected to ea
horizontally or vertically, but not diagonally. The liberties of horizontally
a chain are or vertically, but not diagonally. The liberties of a c
learning
counted together as a unit. An example is Diagram 1-5, where
stones have a combined total of six liberties marked X. When
counted
the two
stones
together
whitehave
black as a unit. An example is Diagram 1-5, where
has a combined total of six liberties marked X. When w
played at all the positions marked X, such that the two blackplayedstonesathave
all the
no positions marked X, such that the two black st
liberties at
liberties at all, then white will remove the two stones together. At no time isall, then white will remove the two stones together.
white allowed to remove any of the two stones individually. whiteAs theallowed
saying to remove any of the two stones individually. As
goes, "One for all, all for one". goes, "One for all, all for one".

Reward on final state (win/lose)

Diagram 1-6 Diagram 1-7 Diagram 1-8 Setting


Diagram 1-6 Diagram 1-7 Diagram

I State space S
Let’s
Let’s take a look at Diagram 1-6. What if black decides to play at take
1 as ashown
look at Diagram 1-6. What if black decides to play
in Diagram 1-7? Notice that the black stone marked 1 has no in liberties,
Diagram but 1-7? Notice that the
I Action spaceblack
A stone marked 1 has no li
the either.
the three white stones (marked with triangle) have no liberties three white
This stones
I Known transition triangle)
(marked with model: p(s,have
a, s 0 ) no liberties
I Reward on final states: win or lose

Silver, David, et al. "Mastering the game of Go with deep neural networks andstrategies
Baseline tree search." Nature
do not apply:
529.7587 (2016): 484-489. I Cannot grow the full tree

I Cannot safely cut branches


Minimax search
• Tic-tac-toe zero-sum two-player game
The minimax algorithm is recursively as

ai* = argmax ai ∈Nt (min a−i ∈Nt+1 u(a−i , ai ))


ai : action from you at step i
a−i : action from opponent at step i
where the set Nt+1 are the successors of node Nt. If Nt is a leaf
node (terminal node), it is value of the node (for you)

Neumann, L. J., & Morgenstern, O. (1947). Theory of games and economic behavior (Vol. 60). Princeton: Princeton university press.
Russell, Stuart, Peter Norvig, and Artificial Intelligence. "A modern approach." Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs 25 (1995): 27.
Minimax search
• Tic-tac-toe zero-sum two-player game
The minimax algorithm is recursively as

ai* = argmax ai ∈Nt (min a−i ∈Nt+1 u(a−i , ai ))


ai : action from you at step i
a−i : action from opponent at step i
where the set Nt+1 are the successors of node Nt. If Nt is a leaf
node (terminal node), it is value of the node (for you)

For Tic-tac-toe, num. terminal nodes 9! = 362,880, but for


chess 1040 and go game 10170!
Two solutions by applying prior knowledge:
a) look only at a certain number of moves ahead (piles) and
apply heuristic evaluation function u, e.g. IBM deep blue
b) reduce Ni the set of possible move (each round), i.e.,
p(action|state)
c) Random sampling by Monte-Carlo simulation
Neumann, L. J., & Morgenstern, O. (1947). Theory of games and economic behavior (Vol. 60). Princeton: Princeton university press.
Russell, Stuart, Peter Norvig, and Artificial Intelligence. "A modern approach." Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs 25 (1995): 27.
Monte-Carlo tree search (MCTS)
Each tree node stores the number of won w(a) /played playouts n(a)
• Coined 2006, MCTR in
real-time expands
search tree by
simulation and is
further extended using
the bandit solution for
exploration
2 ln(n)
ai* = argmax ai (w(a) + ),
n(a)
w(a) : num win for a
• Start with the current state in the tree (e.g., node n: num. visits; n(a): num. visits for a
11/21) as the root and repeatedly running the four
steps until timeout • In 2008, MoGo achieved
dan (master) level in
• Then the move (child for the root node) with the most 9×9 go
simulations made is typically selected (robust criteria) rather
than the move with the highest average win rate (max criteria)
Browne, Cameron B., et al. "A survey of monte carlo tree search methods." Computational Intelligence and AI in Games, IEEE Transactions on 4.1 (2012): 1-43.
Kocsis, Levente, and Csaba Szepesvári. "Bandit based monte-carlo planning." Machine Learning: ECML 2006. Springer Berlin Heidelberg, 2006. 282-293.
Coulom, Rémi. "Efficient selectivity and backup operators in Monte-Carlo tree search." Computers and games. Springer Berlin Heidelberg, 2006. 72-83.
Final move section
• the “best child” of the root: the move finally
played after the simulations
• different empirical ways to define which child is
the best
– Max child is the child that has the highest value.
– Robust child is the child with the highest visit count.
– Robust-max child is the child with both the highest visit
count and the highest value
• If there is no robust-max child, more simulations are played
until a robust-max child58 occurs CHAPTER 5. EXPERIM

Selection Method Max Child Robust Child


Schadd, Frederik Christiaan. "Monte-Carlo As Player 1 2 1 2
search techniques in the modern board game won 36.6 46.6 43.3 50.0
Thurn and Taxis." Maastricht University: draw 5.6 6.6 0.0 0.0
loss 56.6 46.6 56.6 50.0
Maastricht, The Netherlands (2009).
AlphaGo ARTICLE
ARTICLE RESEARCH RESEARCH

• Employ
b two deep
b convolutional neural networks
olicy
worknetwork Value network
Policy network Policy
Value network
network Value network
Neural network

Neural network
p p (a⎪s) (s′)
p (a⎪s) (s′)
Se

on
lf

ssi
Pla

gre
y

Re
Data

Data

s s′
s s′
urrentSelf-play
player wins) in positions from the self-play data set.
positions
Policy network
hematica,representation predicts
of the p(action
neural network=architecture
a| state =used
s) inValue network predicts the expected terminal
cture. A fast the current player wins) in positions from the self-play data set.
haGo. ofThe
humanpolicy moves
network takes a representation of the board position reward of architecture
a given state s in
work pσ are b, Schematic representation of the neural network used
ts input, passes it through many convolutional layers with parameters
Lsitions.
Silver,
policy David, et
network) or al. AlphaGo.
ρ (RL"Mastering The policy
the game
policy network), andnetwork
of takes
Go with
outputs a representation
deep of theand
neural networks
a probability board
treeposition
search." Nature
lized to pthe
529.7587 s as its input, passes it through many convolutional layers with parameters
SL(2016): 484-489.
ibution σ (a | s) or pρ (a | s) over legal moves a, represented by a
earning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
AlphaGo
AR
• Training for policy network predicting human moves
a b
Supervised trained (SL) from
Rollout policy SL policy network RL policy network Value network 30m positions from the KGS
Policy network Va
Go Server

Neural network
p p p p (a⎪s)
(1)pπ (a | s) : fast prediction
with linear softmax
Policy gradient (2)pδ (a | s) : more accurate prediction
with 12-layer CNN. superversied
(3)
learned from
on
Cla

(1) Se
ati

on
ssi

lf P (3)pδ (a | s) : further tuned by policy


fic

ssi
(2)
fic

ssi

lay

gre
ati

Cla

gradient from the terminal

Re
on

reward of self-play.

Data
Gradient update
s
Human expert positions Self-play positions ∂log p(at | st )
Δρ ∝ zt
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins)
∂ρ in positions from the self-pl
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation
zt : terminal rewardof the neural network ar
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation
Silver, David,learning
A reinforcement et al. "Mastering the game
(RL) policy network pρ isof Go withtodeep
initialized the SLneural networks and
s as its input, treeit search."
passes through many Natureconvolutional la
policy network,(2016):
529.7587 and is then improved by policy gradient learning to
484-489. σ (SL policy network) or ρ (RL policy network), and
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, rep
AlphaGo
AR
• Training for value network predicting terminal reward
a b
Rollout policy SL policy network RL policy network Value network Policy network Va
Selfplay pp randomly to
train value network

Neural network
p p p p (a⎪s)

v ρ (s) : position evaluation


Policy gradient
v ρ (s) = Ε[z | st = s, at ...T ~ ρ ]
zt : terminal reward
on
Cla

Se
Gradient update:
ati

on
ssi

lf P
fic

ssi
fic

∂log vθ (s)
ssi

lay

gre
ati

Δθ ∝ (z − v(s))
Cla

Re
on

∂θ

Data
s
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-pl
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network ar
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation
Silver, David,learning
A reinforcement et al. "Mastering the game
(RL) policy network pρ isof Go withtodeep
initialized the SLneural networks and
s as its input, treeit search."
passes through many Natureconvolutional la
policy network,(2016):
529.7587 and is then improved by policy gradient learning to
484-489. σ (SL policy network) or ρ (RL policy network), and
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, rep
haGo. a, Each simulationedge.Qb, The leafis
by the evaluated
policy
node network σ in
andtwo
may be pexpanded; ways:
the output
the using
probabilities
new node the value
are stored
is processed network
as prior
once vthe
that
track
; and
θaction. by running
mean value of all evaluations r(·) and vθ(·) in the subtree below
Q
h maximum action value byQ,theP policyPnetwork
probabilities P
a rolloutfor each action.
pσ andtothe
the c, At the end
endprobabilities of
of the game a simulation, the
withasthe leaf node
priorfast rollout policy pπ, then
AlphaGo: Monte Carlo tree search
Q Q
Q + u(P) Q + u(P)
output are stored that action.
max Q
prior probability Q P for that r. d,
computing the winner (s, a) of theQ
end of with function Action values are tree
updated
stores anto
probabilities P for each action. c, At the a simulation, the leaf node 23
p learning of convolutional networks, won 11% of games against Pachi search action value Q(s, a), visit count N(s, a),
he new node is processed once Q and 12% Qtrack thea slightly
against meanweakervalueprogram,
of all evaluations
Fuego . 24
r(·) 23and vθand (·) prior
in the subtree
probability P(s,below
a). The tree is traversed by simulation (that
PlearningP of convolutional networks, won 11% of games against Pachi a) descending
(s, is, of the search thetree
treestores an action
in complete value
games Q(s, a),
without visit count
backup), N(s,
starting
obabilities are stored as prior
and 12% againstthat
a action.
slightly weakerp
program, Fuego 24
. and prior probability P(s, a). The tree is traversed by simulation
Q Reinforcement
Q learning of value networks from the root state. At each time step t of each simulation, an action a(th
t
d of a simulation, the leaf node The final stage of the training pipeline focuses on position evaluation, is, isdescending selected from thestate
treest in complete games without backup), starti
r
predictsr the outcome from r
Reinforcement
estimating learning r
of value
a value function networks
vp(s) that posi- from
• Combing the two networks (p(a|s) and v(s)) together
the root state. At each time step t of each simulation, an action
tion s of games played by using policy p for both players28–30
n 11% of games against The
Monte Carlo tree search in AlphaGo.
Pachi final23stage of the training pipeline focuses on position evaluation, is selected from state
a, Each simulation
estimating a
(s, a)
value
of theisvpevaluated
function
search
(s) that
tree stores
predicts
in two ways:the
an action
outcome from
using the value
value
posi-
network
Q(s, a), visit count
vθ; and by running
a ts=
t
argmax(Q(s t , a ) + u(s t , a ))
N(s,
a a),
24 ,P(s, Ta).
ram,
e tree byFuego . edge with maximum
selecting the tion s of action
games and
value Q, prior
played by usingvaprobability
p = Eto
(s )policy
rollout [z the
|s tfor
tp =
end sboth
a t…
of the ∼ p]The
game
players with tree
28–30the fastis traversed
rollout by simulation
policy pπ, then
soQasareto updated
maximize a t = argmax (that
(Q(as tbonus
, a ) + u(s t , a ))
r
s u(P) that depends r probabilityrP for that
on a stored prior computing the winner with function r. d, Action values to action value plus
leaf node may be expanded; the new node is processed is, once
Ideally, descending
we wouldtrack like to know
p*(s ) = E[z |the
the
meanthe
tree
value
in
optimal complete
value function
of all evaluations
games
r(·) andunder without backup), starting
vθ(·) in the subtree below
a

ey network
networks pσ and the output probabilities are stored
r Q(s,a),
r N(s,a) rv p forfrom
perfect as play v
priorv the
(s); root
in practice,
t s
that action.t =
state. s
we , a At
instead
t … T ∼
eachp ] time
estimate the step
value t of
function each
so assimulation,
to maximize an
action action
u(s, a ) ∝
value a
plus
P(s, a )
ta bonus
n
s Ptwo
ocuses ways:
for each
on using thethevalue
position
action. c, At end ofnetwork
evaluation,
a simulation,vθ; and the leafour runningpolicy, using the RL policy network pρ. We approx-
strongest
by node
ρ
1 + N (s, a )
he end of the game with the fast rolloutimate
Ideally, we would istheselected
value likefunction
to knowfrom using state
the aoptimal
value stnetwork
value vfunction under θ,
θ(s) with weights
policy* ppπ, then
redicts the
heconvolutional
winner with outcome
networks,
function wonfrom
11%
d,
r.network posi-
perfect
of games
Action
playv
values
( s v
) ≈(s);
v
θagainst Pachi
Q
in
(
are
s ) ≈ v
23
updated

practice,
(ρs ) .
(s, a)we
This instead
neural estimate
network has the
a value
similar function
architecture
of the search tree stores an action value Q(s, a),that
tooutputs visitiscount
proportional
N(s, a), to u(sthe
, a ) prior

Pprobability
(s, a ) but decays with
wo ways: using the value
28–30 v pρ
for vour; and
to the by
strongest running
policy network,
policy, usingbut the RL a single
policy prediction
network p instead
. We of
approx-a prob-
for
gainst
an
ndvalue
both
a players
slightly
of theofgame
weaker program,
all evaluations Fuego
r(·) rollout
24θ.
and v (·)ability in the subtree
, then below
and prior a t =ofargmax
probability P(s, a). The(Qweights
(s tby, aregres- repeated
)θ,+ u(leafs t , node visits to encourage exploration.
1 + N When
(s , a ) the traversal reaches a
a )) sL at step L, the leaf node may be expanded. The leaf position
ρ tree is traversed by simulation (that
with the fast imate θthepolicy valuedistribution.
pfunction π
We train
using the weights
a value network the
v value
is, descending the tree in network
(s) with
θ complete games without backup), starting
ement learning of value ≈ v sion on vstate-outcome
⁎ pairs (s, z), usinghas stochastica gradient descent to sL is processed just once by the SL policy network pσ. The output prob-
inner with function r. d, networks
Actionvθ(s ) values pρ
(s )Q ≈are ) . This neural
(supdated to
from thenetwork
root state. Ataeach
similar timearchitecture
step t of each simulation, an action at to the prior probability but decays wi
minimize the mean squared error (MSE) between the predicted value that is proportional
a t…
age
alue ofof
Tthe all p] p pipeliner(·)
∼training
evaluations focuses
and on
to the position
vθ(·)policy
in the evaluation,
network,
subtree butbelow
outputs
is selected
vθ(s), and the corresponding outcome z
a single from prediction
state st instead of a prob- abilities are stored as prior probabilities P for each legal action a,
repeated visits to encourage exploration. When the traversal reache
a value function v (s) that predicts the outcome
ability distribution.so
fromas We to
posi- maximize
train the weightsaction of the value valuenetwork plus a
by regres-bonus leaf
P(s, a ) = pσ (a|s ) . The leaf node is evaluated in two very different ways:
node s L value L,
at step the leafv node may be expanded. The leaf positi
earch
mes treeby
played stores
usingan action
policy p forvalue
both Q(s,
players a), visit count
28–30 N(s, a),using a t = argmax (Qdescent
(s t , a ) +to first, by the
u(s t , saL))is network θ(sL); and second, by the outcome zL of a
optimal value function traversed sion on state-outcome
under by simulation (that pairs (s, z), ∂ v θ( stochastic
s ) gradient random rollout played out until terminal network
processed just once by the SL policy step T using pσ. the
Thefast
output pro
rollout
obability P(s, a). The tree isminimize the mean squared error ∆θ ∝(MSE) between (z − vθ(s )) the apredicted value
∂θ abilities are stored as prior probabilities
P(s, a ) policy pπ; these evaluations are combined, using a mixing parameter P for each legal action
ad
ng the estimate v p(s )in
tree = the
E[z t|svalue
completet = s, a tfunction
games pwithout
vθ∼(s),]and the corresponding
backup), starting outcome z
ch tree stores antime
action value
…T
Q(s,simulation,
a),The visit count so
N(s,ofa),as to maximize u
action( s , a
value ) ∝plus a bonus P(sλ,, ainto
) = apσleaf s ) . The leafV(s
(a|evaluation node
L) is evaluated in two very different wa
ot
we policy
state.
wouldAt network
likeeach
to know pthe. We
ρstep approx-
t of
optimal each
value sisting
function
naive approach
under an action predicting
at game outcomes from data con- first, by the
1 + N ( s , a ) value network vθ(sL); and second, by the outcome zL o
ability P(s,s a). Thewith tree weights
isestimate
traversed by simulation
of complete games (that
∂vθ(sleads ) to overfitting. The problem is that random rollout played V (outsL ) = λ )vθ(sL ) +step
(1 −terminal λzLT using the fast rollo
yomnetwork
state
v*(s); t vθ(s)we
in practice, instead θ,successive
the value function ∆θ ∝are strongly
positions (z correlated,
− vθ(s )) differing by justPone(s, astone,
) until
he tree inacomplete games without u ( s , a ) ∝
work
strongest has policy, similar
using the RL architecture
policy network but .backup),
pρthe approx- starting
Weregression target∂isθ shared for the entire game. When s, a ) policy
1 + N (trained pπ; these evaluations are combined, using a mixing parame
ate. At
value each
function time
using astep
value t of
networkeach vsimulation,
(s) with that
weights an isθ, proportional
action a to the prior probability λ, into
At the but
end
a leafedgesevaluationdecays
of simulation,
V(sLEach) with
the action values and visit counts of all
a = argmax ( Q ( s , a ) + u (θs ,ona ))the KGS data set in this t way, the value network memorized the traversed are updated. edge accumulates the visit count and
(le prediction instead t of a prob-
hasThe naive approach of predicting game outcomes from data con- When
t t
sstate
)≈ v ⁎s(s ) . This neural
Selection: a network a similar repeated
gamearchitecture
outcomes rather
Expansion: visits
that than is to encourage
generalizing
proportional to new to exploration.
positions,
the prior achieving
probability a mean but thedecaystraversal
evaluation with
of all reaches
simulations a
passing through that edge
yof thet value
network, network
but outputs a singleby sisting
regres-
prediction of complete
instead
minimum of
leaf a games
prob-
node
MSE of leads
s
0.37 at
on
repeated
to test
step
the
L correlated,
overfitting.
visits L,
set, tothe
compared Thetoproblem
leaf
encourage node
0.19 on may
the
exploration.
is training
that
be
When expanded.
the traversal The
reaches V
leaf
a ( s L ) =
position( 1 − λ )vθ(sL ) + λzL
successive positions
set. To bymitigateare strongly
thisleaf
problem, differing by just
a Evaluation: one stone,
ibution.action We trainvalue the weights ofbonus
the value network regres- node swe generated new self-play data set n
stochastic
imize a t = argmax gradient
pairs (s, (z),
plus
Qusing descent
a
(s t , astochastic to
) + u(s t ,gradient
but the aconsisting Initiation
s
)) descent
regression is processed
target
L of 30tomillion isof
shareda just
for L at step L, the leaf
theonce entire by the
game.
node
SL
When
may be
policy
trained
expanded.
network The leaf position
p . The Backupi=prob-
Noutput
(s , a ) = ∑ (Q 1(s, a, i )
and
actionN):
te-outcome sL isdistinct
processed positions,
just once each bysampled
the SL policy a sepa- pAt
from network the end σ of simulation,
σ. The output prob-
the values and visit counts of
E) between
he mean squared a the predicted
error (MSE) between onvalue
thetheKGSrate new
data
game.
predicted set node
abilities are
Eachin
value this
game with
way,
was
abilities the
stored
played value
between
are stored network
as prior the
as priorRL memorized
Randomprobabilities
policy network
probabilities the
rollout P forPeach
and forlegal
traversed each
edges action legal
are a, action
updated.
1
1 a,
Each edge
n accumulates the visit count a
he corresponding outcome z P(sgame , a ) outcomes
itself until the game
rather thanPterminated.
generalizing Training
to new on this data achieving
positions, set led to MSEs a mean evaluation Q
of ( s
all, a ) =
simulations ∑ 1(
passings , a , i )V
through( s i
) that edge
where u(s, a ) ∝ minimum ofMSE Pof
0.226 s0.37
(and , a0.234
)on=the pσ(the
on
s,aa|)s=) p.σThe
(test training
set,
(a|s ) . leaf
The leaf
and test to
compared setnode nodeis
played
respectively,
0.19 on theevaluated
indicating in two very differentNways:
is evaluated
until
training
in two very different ways: (s, a ) i =1
L
ze action value plus a1bonus + N (s, a ) minimal overfitting. first, by the value network vevaluationθ(sL); and second, by the outcome zL of a
∂vθ(s )
(z − vθ(set.s )) To mitigate
first, by the
this problem, Figurevalue
we 2b network
shows
generated theaposition
new vterminal
(s
self-play
θuntil L); and
data second,
accuracy
step set T. T bysithe outcome znLtheofithasimulation, and 1(s, a, i) indicates
∆θ ∝ of the value network, compared to Monte Carlo rollouts using the fast using
random rollout played out terminal step wherethe fast rollout
L is the leaf N node
(s , a ) from
= ∑ 1(s, a, i )
vθ(s )) to the
−portional ∂θ consisting of 30
P(s, probability
prior arate
) game.rollout random
but
million distinct
policy decays rollout
pπ; the policy
with
value πplayed
positions,
pfunction
; these was each out until
sampled
evaluations leaf
consistently terminal
arefrom
node
more
a sepa-using
combined, update
accurate. step T using
awhether
mixing parameter
an edge the (s, a)fastwas rollout
traversed
i=1 during the ith simulation. Once
u ( s , a ) ∝ Each game was played
λ, into between
a leaf the RL policy
evaluation V(s ) network and
itsapproach
ve to encourage exploration.
of predicting game outcomes When A
the policy
single evaluation
thetraversal
from data con- preaches
π; these
of v θ (s) evaluations
also
apolicy approached the
as are
L
combined,
accuracy
to MSEs using
of Monte the
fromOnce
a mixing
search is complete,
search parameter
the algorithm n
1 the most chooses the most visited move
1 + N ( sitself
, a ) untilCarlo game
rollouts terminated.
using the RLTraining on thispdata
network set led
ρ, but using 15,000 times the root Q(s, is
position. a )done,
= ∑ 1(s, visited
a, i )V (s iL)
omplete games
at step L, the leaf node may beleads to overfitting. The
expanded.
of 0.226 problem
and λ,The
0.234 into
less computation.
is that
onleaftheaposition
leaf evaluation
training V(s
V (sLL) )= (1 − indicating
and test set respectively, λ )vθ(sL ) + λzL It is move
worthis chosen
noting that the NSL (spolicy
(including, a ) i =network
1AlphaZero stocastically
pσ performed better in
ame
positions outcomes
are strongly from correlated, data
ed just once by the SL policy network pσ. The output prob- con-
differing
minimal by just one stone,
overfitting. Figure 2b shows the position evaluation accuracy AlphaGo with than the stronger
a temperature parameter and RL policy network p ρ MuZero) because
, presumably
tional
ression
erfitting. to
target the
is shared
The priorfor the
problem probability
entire
ofisthegame.
that but
When
Searching decays
trained
with with
policyAt theand end of
value simulation,
networks the action values and
where visit
humans s i counts
is the
select of
leaf
a all
node
diverse from
beam the
of ith simulation,
promising moves, and 1(s, a,RL
whereas i) indica
opti-
stored setas in prior probabilities Pvalue network,
formemorized
each legal compared
action to a,Monte Carlo rollouts1using
V (updated.
snetworks
L ) = (Each −anedge
λ the fast
)vaccumulates
sL ) +
θ(algo- zLvisit
λthe L
oaS |data
encourage
s )Silver,
thisexploration.
David,
way, the valueWhen
et is
al.evaluated
"Mastering
network
rollout the
the game
in policy traversal
AlphaGo pπ; the
the
combines reaches
of value
Go traversed
the
with
function a
policy
deepwas edges
and
neural are
value networks
consistently in
more and MCTS mizes
whether
tree search." Nature
accurate. for
an count
the
edge and
single
(s, a) best
was
529.7587 (2016): 484-489. move.
traversed However,
during thethe ithvalue function
simulation. On
ted,
omes differing
.rather
The leaf thannode by just
generalizing one
to new stone, two
positions,
rithm very(Fig. different
achieving a selects
3) that ways:
mean actions byof
evaluation lookahead
all simulations search. Each edge
passing through v s
that
( ) ≈ v
edgepρ
( s ) derived from the stronger chooses
RL policythe network performed
ep L, the leaf node may beAexpanded. single evaluation The leaf of vposition
θ(s) also approached the accuracy of Monte the θsearch is complete, the algorithm most visited mo
game winner from the perspective of the current player at step t. In parallel (Figu

ework
AlphaZero: without human knowledge
drops parameters ✓i arethreshold,
below a resignation trainedorfrom
whendata (s, ⇡ , z) sampled uniformly among all ti
the game
last scored
then iteration(s)
to give a of self-play.
final reward of rThe
T 2 { neural
1, +1} network
(see (p, v) = f✓i (s) is adjusted to m
• Self-play
h time-step t is stored as (st , ⇡ t , zt ) where zt = ±rT is
or between the predicted value v and the self-play winner z, and to maximise the s
f the current player at step t. In parallel (Figure 1b), new
neural network move probabilities p to the search probabilities ⇡ . Specifically,
data (s, ⇡ , z) sampled uniformly among all time-steps of
sural
✓ are adjusted
network (p, v) by
= fgradient descent on a loss function l that sums over mean-squar
✓ (s) is adjusted to minimise the
i

ss-entropy lossesz,respectively,
the self-play winner and to maximise the similarity of

• Add an additional neural network outputing


p to the search probabilities ⇡ . Specifically, the parame-
n a loss function l that sums over mean-squared error and
policy p and value v by imitating MCTS
(p, v) = f✓ (s), l = (z v)2 ⇡ > log p + c||✓||2
Anthony, Thomas, Zheng
Loss:
ere c is a parameter controlling 2the level of L2 weight regularisation (to prevent ove
>
Tian, and David Barber.
"Thinking fast and slow with
l = (z2
v) ⇡ log p + c||✓|| (1)
deep learning and tree
search." arXiv preprint
velSilver,
of L2David,
weight
et al. regularisation (toofprevent
"Mastering the game go withoutoverfitting).
arXiv:1705.08439 (2017).
human knowledge." nature 550.7676 (2017): 354-
Empirical
359. Analysis of AlphaGo Zero Training
MuZero: learning the transition

• (A)1:MCTS
Figure with aacting,
Planning, learned andtransition,
training with(B)aSelf-play, and (C(A)
learned model. ) Build up a hidden
How MuZero uses itsstate
models to plan.
k

The to represent
model the
consists of state
three and base
connected on it toforbuild
components a transition
representation, model
dynamics and(to run MCTS
prediction. in a(A))
Given previous
andstate
hidden also
s make
k 1
and aprediction about
candidate action a ,the
k
value and
the dynamics optimise
function the policy
g produces an immediate reward r and a new
k

hidden state s . The policy p and value function v are computed from the hidden state sk by a prediction function
k k k

. The initial hidden


fSchrittwieser, state s0 et
Julian, is obtained by passingatari,
al. "Mastering the pastgo,
observations
chess and (e.g. shogi
the Go board or Atari screen)
by planning with into
a
alearned
representation function
model." h. (B) 588.7839
Nature How MuZero(2020):
acts in the604-609.
environment. A Monte-Carlo Tree Search is performed
References
• Reinforcement learning slides are largely based
on
– Sutton, Richard S., and Andrew G.
Barto. Reinforcement learning: An introduction. Vol. 1.
No. 1. Cambridge: MIT press, 1998.
– Weinan Zhang’s lecture notes at Shanghai Jiaotong
University
• Szepesvári, Csaba. "Algorithms for reinforcement
learning." Synthesis lectures on artificial intelligence
and machine learning 4.1 (2010): 1-103.
• Neto, Gonçalo. "From single-agent to multi-agent
reinforcement learning: Foundational concepts and
methods." Learning theory course (2005).
Deep Reinforcement Learning and Game

• Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction.


MIT press, 1998.
• Neumann, L. J., & Morgenstern, O. (1947). Theory of games and economic
behavior (Vol. 60). Princeton: Princeton university press.
• Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep
reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
• Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement
learning[J]. arXiv preprint arXiv:1312.5602, 2013.
• Silver, David, et al. "Mastering the game of Go with deep neural networks and tree
search." Nature 529.7587 (2016): 484-489.
• Russell, Stuart, Peter Norvig, and Artificial Intelligence. "A modern approach."
Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs 25 (1995): 27.
• Browne, Cameron B., et al. "A survey of monte carlo tree search methods."
Computational Intelligence and AI in Games, IEEE Transactions on 4.1 (2012): 1-43.
• Kocsis, Levente, and Csaba Szepesvári. "Bandit based monte-carlo planning."
Machine Learning: ECML 2006. Springer Berlin Heidelberg, 2006. 282-293.
• Coulom, Rémi. "Efficient selectivity and backup operators in Monte-Carlo tree
search." Computers and games. Springer Berlin Heidelberg, 2006. 72-83.

You might also like