Maai 6
Maai 6
Single-agent
Reinforcement Learning
(Value-based approaches)
Prof. Jun Wang
Computer Science, UCL
Recap
• Lecture 1: Multiagent AI and basic game theory
• Lecture 2: Potential games, and extensive form and
repeated games
• Lecture 3: Solving (“Learning”) Nash Equilibria
• Lecture 4: Bayesian Games, auction theory and
mechanism design
• Lecture 5: Learning and deep neural networks
• Lecture 6: Single-agent Learning (1)
• Lecture 7: Single-agent Learning (2)
• Lecture 8: Multi-agent Learning (1)
• Lecture 9: Multi-agent Learning (2)
• Lecture 10: Multi-agent Learning (3)
Subject to change
learning in single agent
Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:1312.5602, 2013.
AlphaGo vs. the world’s ‘Go’ champion
2016
http://www.goratings.org/
Coulom, Rémi. "Whole-history rating: A bayesian rating system for
players of time-varying strength." Computers and games. Springer
Berlin Heidelberg, 2008. 113-124.
Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
• Convergence Analysis
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
– Convergence Analysis
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
Reinforcement Learning
• Reinforcement learning (RL)
Chapter 2 to take actions over an environment so as to maximize some
– learning
numerical value which represents a long-term objective.
• The single agent reinforcement learning framework
Single-Agent Framework
– A agent receives an environment’s state and a reward associated with the last
state transition.
– It then calculates an action which is sent back to the system.
– Inreinforcement
The single agent response,learning
the system makes
framework a transition
is based to a new state and the cycle is
on the model
of Figure 2.1, where an
repeated.agent interacts with the environment by selecting
actions to take and then perceiving the e↵ects of those actions, a new state
Theindicating
and a reward–signal problem if itishas
to reached
learn asome
waygoalof (or
controlling
has been pe-the environment so as to
maximize
nalized, if the reward the The
is negative). totalobjective
reward. of the agent is to maximize
Observation Action
Reward
Reinforcement Learning Definition
• A computational approach by learning from
interaction to achieve a goal Agent
Observation Action
• The environment
– Receives action At
– Emits observation Ot+1
– Emits scalar reward Rt+1
• t increments at each
environment step
Environment
Elements of RL Systems
Elements of RL Systems
• Elements
History is the of RL Systems
sequence of observations, action, rewards
• History is the sequence of observations, action, rewards
Ht = O1 ; R1 ; A1 ; O2 ; R2 ; A2 ; : : : ; Ot¡1 ; Rt¡1 ; At¡1 ; Ot ; Rt
• History is the sequence of observations, action, rewards
• i.e. all observable variables up to time t
i.e.
• H all observable
t = •OE.g.,
1 ; Rthe
1; A
variables A2 ; :up
1 ; O2 ; R2 ;stream
sensorimotor : :a; to
of
time
Ot¡1
robot ;R
or
t
t¡1 ; At¡1
embodied ; Ot ; R t
agent
• E.g., the sensorimotor stream of a robot or embodied agent
• What happens next depends on the history:
• i.e. all observable variables up to time t
• The agent selects actions
• What happens
• E.g., the next stream
sensorimotor depends on or
of a robot the history:
embodied
• The environment selects observations/rewards
agent
• The
• What agent
• happens
State selects
next
is the actions
depends
information ontothe
used history:what happens
determine
Thenext
•• The (actions,
selects observations,
environment
agent rewards)
selects observations/rewards
actions
• Formally,
• The stateselects
environment is a function of the history
observations/rewards
• State is the information Sused to) determine what happens
• State is the information usedt to determine
= f (H what happens
next (actions, observations, rewards)
t
• Value function
•–Value function
State value is a scalar specifying what is good in the
• State
long runvalue is a scalar specifying what is good in the long
run
– Value function is a prediction of the cumulated future
• Value function is a prediction of the cumulated future
reward
reward
• Used to evaluate the goodness/badness of states (given the
• Used policy)
current to evaluate the goodness/badness of states (given the
current policy)
v¼ (s) = E¼ [Rt+1 + °Rt+2 + ° 2 Rt+3 + : : : jSt = s]
Elements of RL Systems
Elements of RL Systems
Elements of RL Systems
• A Model
• A Model of the
of the environment
environment
that reflects
that mimics the the behavior
behavior ofof
• Athe
Model the
of environment
environment
the environment
that• mimics thethe
behavior of
– Predict
Predict
the environment
the
next next
statestate
•PPredict
a
= the next
P[S = state
s 0
jSt = s; At = a]
ss 0 t+1
a 0
Pss 0 = P[St+1 = s jSt = s; At = a]
• Predicts the next
– Predicts
• Predicts the next the next
(immediate) reward
(immediate)
(immediate) reward reward
a
R
Ra =
= E[RjSt+1=jSs;t A= =s;a]At = a]
s E[R
s t+1 t t
The agent may know the model of the environment or may not!
Maze Example
• Definition
• Definition
– A •state S is Markov if and only if
A state tS is Markov if and only if
t
• Properties
• Properties
• The • The
state captures
state capturesall
all relevant information
relevant information fromfrom
the the
history
history
• Once the state is known, the history may be thrown away
• Once the state is known, the history may be thrown away
• i.e. the state is sufficient statistic of the future
i.e. the state is sufficient statistic of the future
Markov
Markov Decision
Decision ProcessProcess
• • AA Markov decisionprocess
Markov decision processis isa tuple
a tuple
(S,(S,
A, A,
{Psa{P
},saγ,},R)γ, R)
• • SS is the set
set of
ofstates
states
–• E.g.,
E.g., location
location in
in aamaze,
maze,ororcurrent
currentscreen in in
screen an an
Atari game
Atari game
• • AA is
is the
the set
set of
ofactions
actions
• E.g., move N, E, S, W, or the direction of the joystick and the
– E.g., move N, E, S, W, or the direction of the joystick and the buttons
buttons
• • PPsa are the state
are the statetransition
transitionprobabilities
probabilities
sa
–• For
For each
each state
state ss ∈∈SSand
andaction
actiona a∈ ∈A,A,
PsaPis isdistribution
sa a a distribution
overover
the the next
state in S in S
next state
• • γγ ∈[0,1] isis the
thediscount
discountfactor
factorforfor the
the future
future reward
reward
• • R : S £ A 7! R isisthethereward
rewardfunction
function
–• Sometimes
Sometimes the
the reward
rewardisisonly
onlyassigned
assignedto to
state
state
Markov Decision Process
Markov Decision Process
TheMarkov
MarkovofDecision
dynamics Process
an MDP proceeds
Decision as
Process
The dynamics of an MDP proceeds as
• Start in a
The dynamics
state
of an
s0MDP proceeds as
• Startofinan
The dynamics a state
MDPs0proceeds as
• The agent
• Start chooses
in a•state
The s0
agent
some action
chooses some
a0 ∈aA∈ A
action
• Start in a state s0 0
• The
• Theagent
agent • getsagent
chooses
The the reward
some
gets action
the R(s
a0 ∈0,a
reward 0)0,a0)
AR(s
• The agent chooses some action a0 ∈ A
• The agent gets the rewardtransits
R(s0,a0)to some successor state s » P
• MDP
• Therandomly
•
agent MDP
gets transits
randomly
the reward toR(s
some,a ) successor state 1 s0 a0
• MDP randomly transits to some successor state s1 » Ps0 a0
0 0
• This • This proceeds
proceeds iterativelyiteratively
•
• This proceeds iteratively ato some successor
MDP randomly transits 0 a 1
state
a
s1 » Ps0 a0
2
s0 ¡¡¡¡ ¡! s1 ¡¡¡¡ ¡! s2 ¡¡¡¡ ¡! s3 ¢ ¢ ¢
• This proceeds a0iterativelyR(s a;a
0 1 0) R(s1
a;a
s0 ¡¡¡¡¡! s1 ¡¡¡¡¡! s2 ¡¡¡¡¡! s3 ¢ ¢ ¢ 2 )
2 1) R(s2 ;a
R(s0 ;aa
0 )0 R(s1 ;a1a) 1 R(s2 ;a2 )a2
s
• Until ¡ ¡ ¡ ¡ ¡
0 a terminal! s1 state s ors2proceeds
¡ ¡¡ ¡ ¡! ¡¡¡¡¡!with
s3 ¢ ¢ no
¢ end
R(s ;a ) R(s ;a T) R(s ;a )
• Until a terminal state
0 0 1 1
sT or proceeds with no2 2
end
• Until a terminal
• The state ssT or
total payoff of proceedswith
the agent is with no end
• •The
Until a terminal
total payoff of state
the agent
T or proceeds
is 2
no end
R(s0 ; a0 ) + °R(s1 ; a1 ) + ° R(s2 ; a2 ) + ¢ ¢ ¢
• The total payoff
• The total
R(spayoff of the
0 ; a0 ) +of the1agent
°R(s agent °is
; a1 ) + is2
R(s2 ; a2 ) + ¢ ¢ ¢
R(s0 ; a0 ) + °R(s1 ; a1 ) + ° 2 R(s2 ; a2 ) + ¢ ¢ ¢
MDP MDP
Goal Goal
and and
Policy
MDP Goal and Policy Policy
MDP
MDPGoal
Goaland
and Policy
Policy
•• The
The goalgoal
• goal
The isistoto choose
ischoose
to actions
actions
choose over
actionsover timeto
over time
time tomaximize
to maximize
maximize the
the
the
MDP Goal and Policy
expected
expected
MDP
• The
expected cumulated
goalcumulated
Goal
is and
tocumulated
choose reward
reward
Policy
reward
actions over time to maximize the
• The goal is to choose actions over time to maximize the
expected
expectedcumulated reward
cumulated
E[R(s 0 0 )reward
)
E[R(s + °R(s
+ °R(s1 )1 )++°°22R(s
R(s22)) +
+¢ ¢¢ ¢ ¢] ]
• The goal is to choose actions over time
2 to maximize the
The goal is to choose E[R(s
E[R(s0 0))++
actions °R(s
over 11)) +
°R(stime + °° 2to
R(s
R(s ) + ¢ ¢ ¢ ] the
maximize
22) + ¢ ¢ ¢ ]
•
• γγ • γ ∈[0,1]
expected
∈[0,1]
∈[0,1] is
is cumulated
theis the
the discount
discount reward
discount factorfor
factor
factor forthe
for thefuture
the future
future reward,
reward,
reward, which
which
which
expected cumulated
• γ• ∈[0,1]
γmakes
reward
the agent prefer immediate reward to
∈[0,1]
theisagent
the discount factor for the future tofuture
reward, reward
which
is the discount factor 2
for future
¢ ¢ ¢ ] reward, which
makes
makes
makes
makes
the
• Inthe
agent
the
E[R(s
agent
agent
finance
prefer
prefer
0 ) + °R(s
prefer
prefer
case,
immediate
immediate
1) + °
immediate
R(s 2) +
$1 is morereward
immediate
today’s
reward
reward
valuableto
reward to
to future
future
future reward
$1 inreward
future
than reward
reward
tomorrow
2
•• E[R(s
•In
Inγ finance
∈[0,1]
finance
•
) is+case,
0finance °R(s
case,
the 1 ) +factor
today’s
today’s
discount °$1
$1R(s
is for
• In finance case, today’s $1 is more valuable than
In case, today’s $1 is 2)
more
ismore
morethe+valuable
¢ ¢ ¢ ] reward,
valuable
future
valuable than
than
than$1
$1
$1inintomorrow
which
$1inintomorrow
tomorrow
tomorrow
• makes
Given the a particular
agent prefer policy
immediate ¼(s)reward : S 7!toAfuture reward
γ ∈[0,1]•• is
Given
Given
• Given
•the
Given •a
•aInparticular
aparticular
discount aparticular
particular
finance
i.e. take factor
case,
the policy
policy
policy
policy
today’s
action for
$1
a= ¼(s)
the
¼(s)
is¼(s)
more
¼(s) at: 7!
future
::valuable
SS SA
7! 7!
stateA s A
reward,
than which
$1 in tomorrow
makes the ••agent • take
i.e. take the action a a= = ¼(s) at atstate s tos future reward
••i.e.
Define the value function for ¼ s s
i.e.
•Giventake
i.e. prefer
take
a the
the
theaction
immediate
action
action
particular a
policy = ¼(s)
¼(s)
¼(s) at
: at
reward
S state
state
!
7 state
A
• In finance • case,
Define
• i.e. the value function for ¼ sthan2 $1 in tomorrow
•• Define
• Define
Define the
thetoday’s
thetake
V¼ value
the
¼ value
value$1 is
action a = ¼(s)
more
function
function
function
(s) = E[R(s valuable
at ¼state
for) +¼
for
0 ) + °R(sfor 1) + 2 ° R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]
• Define V (s) = E[R(s
¼ the value function for ¼
0 ) + °R(s 1 ° 2 R(s 2 ) + ¢ ¢ ¢ js0 = s; ¼]
¼
V (s) = E[R(s ) + °R(s ) + ° R(s ) + ¢ ¢ ¢ js = s; ¼] 2
Given a particular
V • (s) policy
i.e. ==E[R(s
expected
V ¼expected
• i.e. (s)
¼(s)
E[R(s
: S°R(s
00 ) +
cumulated
0 ) + °R(s
cumulated
7! 1 A1°)2+
reward ° ) R(s
given
1 ) +given
reward R(s 2 +
the
2 start
the
start
) + s;
¢0 ¢ ¢ and
js0 =state
¢ ¢ ¢2
state and¼]
js0 = s; ¼]
takingtaking
• i.e. take the• action actions
i.e.actionsa =according
¼(s)
according
expected toatto
cumulated ¼reward
¼ state s given the start state and taking
•• i.e. • expected
i.e.actions
expected cumulated
i.e. expected to ¼ reward
cumulated
cumulated
according givengiven
reward
reward given the
the
the start start
andstate
start
state stateand
taking andtaking
taking
actionsactions according to
according to¼¼ ¼
actions
Define the value according
function to
for
V ¼ (s) = E[R(s ) + °R(s ) + ° 2 R(s ) + ¢ ¢ ¢ js = s; ¼]
makes the agent prefer immediate reward to future reward
• In finance case, today’s $1 is more valuable than $1 in tomorrow
Bellman
Bellman Equation
• Given a particular
Equation policy ¼(s)for
for Value
: S 7!
Value A Function
Function
• i.e. take the action a = ¼(s) at state s
•• Define
Define the
the value
value function
function for
for ¼
• Define the value function for ¼
V (s) = E[R(s0 ) + °R(s1 ) + ° 2 R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]
¼
Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., & Bertsekas, D. P. (1995). Dynamic
programming and optimal control (Vol. 1, No. 2). Belmont, MA: Athena scientific.
Optimal
Optimal ValueValue Function
Function
MDP Goal and Policy
Optimal
• The Optimal
optimal
• The optimal
Value
Value
value
valuefunction
Function
Function
function for eachstate
for each states sisisbest
best possible
possible
Optimal
sum of
• sum
The of Value
discounted
discounted
goal is to Function
rewards
rewards
choose that can
actions can
over betime
be attained
attained byby
anyany
to maximize
• The optimal value function for each state s is best possible
policy
the
• The optimal value function for each¼state s is best possible
policy
expected cumulated
sum of discounted ¤
sum of discountedVrewards
rewardsreward
(s) = that
max
that can V (s)attained
be attained
can be by any policy
by any policy
• The optimal value function for each ¼ state s is best possible
V ¤¤(s) = max V ¼ (s) 2
• The
sum of discounted rewards
Bellman’sE[R(s 0 ) that
equation
V (s) + ¼ can
°R(s
=for
max ¼ attained
Vbe
1) +
optimal(s)° R(sby
value 2 )any
+ ¢policy
function¢¢]
• The Bellman’s
• The Bellman’s equation
equation
¤
foroptimal
for optimal
¼¼ X valuevalue
functionfunction
•• The V ¤
¤(s) =
Bellman’s Vequation
(s)
R(s)=+ max max
for VX °(s) value
optimal P
0 sa¤(s
0 ¤ 0
0 )V (s )
function
γ ∈[0,1]V is(s)the discount
= R(s) + max¼ ° factor
a2A X Psa (sfor)V the (s ) future reward, which
• Themakes
Bellman’s
V ¤theequation
(s) = R(s)for
agent
a2A
+ optimal
prefer
max 0 s°2Svalue
0
s 2S
immediate P function
(s 0
)V ¤ 0
reward
(s ) to future reward
• TheVoptimal
• The optimal policy
policy a2A
X sa
¤ 0 2S(s0 )V ¤ (s0 )
(s) =
• In finance R(s) + max ° X X P
• The optimal
• The optimal
policy
¼
¼ policy
case, a2A
¤¤ (s) = arg
(s) = arga2A
today’s
maxmax 0
s
$1P issa
sa
more
(s 0
)V ¤ valuable
00 ) ¤ 0
(s
Psa (s )V (s )
than $1 in tomorrow
s 2S
X
• Given
0
• The optimalapolicy
particular a2A
s 2S
policy
¼ ¤
(s) = arg max
X
¼(s)
s0 2SP : 0 )V
(s S 7! ¤ 0A
(s )
• For every state s and every policy ¼ sa
• For• every
¤ a2A 0 ¤ 0
i.e.
¼ take
(s) =the
state argsaction
maxevery
¤and
a¼=sP0 2S
¤ ¼(s)(s )V
policy
sa ¼
at¼state
(s ) s
V (s)a2A= V0 (s) ¸ V (s)
• For•• every
Define
For everystate
the
state s and
value
s and¤ every
s 2S ¤
function
V (s) = Vpolicy
every policy
¼ for¼¼ ¼
(s) ¸ V (s)
• For every state
¼
s and ¤every policy
¼¤
¼ ¼ 2
V (s) = ¤
VE[R(s
(s) =
0
¼
)
¤ V+ °R(s
(s) ¸
¼ 1 )
V + °
(s) R(s2 ) + ¢ ¢ ¢ js0 = s; ¼]
V (s) = V (s) ¸ V (s)
• i.e. expected cumulated reward given the start state and taking
actions according to ¼
ValueIteration
Value Iteration&&Policy
PolicyIteration
Iteration
Note
• •Note that
that the
the value
value function
function and
and policyare
policy arecorrelated
correlated
X
¼
V (s) = R(s) + ° Ps¼(s) (s0 )V ¼ (s0 )
s0 2S
X
¼(s) = arg max Psa (s0 )V ¼ (s0 )
a2A
s0 2S
• •ItItisisfeasible
feasibletotoperform
performiterative
iterativeupdate
updatetowards
towardsthe
theoptimal
optimal
valuefunction
value functionandandoptimal
optimalpolicy
policy
Valueiteration
• •Value iteration
Policyiteration
• •Policy iteration
Value Value Iteration
Iteration
Value Iteration
anFor
•• For
• For anan
MDP MDP
with
MDP with
finite
with finite
state
finite state
andand
state and action
action spaces
spaces
action spaces
jSj jSj < 1;
< 1; jAjjAj
<1 <1
Value
•• Value
• Value iteration
iteration
iteration isisperformed
performed
is performed as asas
1. 1. 1. For
For Foreach
each each state
state
state s,s,initialize
initialize
s, initialize V(s)
V(s)
V(s) = 0.
= 0.
= 0.
2. 2. 2. Repeat
RepeatRepeat until
until
until convergence
convergence
convergence { {{
Foreach
For For
each each state,
state,
state, update
update
update
XX 0 0
V (s) = R(s) +
V (s) = R(s) + max max
° ° P sa0(s )V (s
0 )
Psa (s )V (s )
a2A
a2A s0 2S
s0 2S
} }}
Notethat
•• Note thatthere
thereisisno
noexplicit
explicitpolicy
policyinin above
above calculation
calculation
• Note that there is no explicit policy in above calculation
Value Iteration Example: Shortest Path
Policy
Policy Policy
Iteration
Iteration
Policy Iteration
Iteration
Value Iteration & Policy Iterat
•For
• For• an
For an
an
MDP
For MDP
anMDP
MDP withfinite
withwith
finite
with finite
finite
state state
state
and
state andand
and • action
action
action
action spaces
Note spaces
spaces
that
spaces the value function and policy are correla
jSj X
jSj
jSj < <<1;
1; 1;
jAj <jAj
jAj <
11<1 V ¼ (s) = R(s) + ° Ps¼(s) (s0 )V ¼ (s0 )
s0 2S
• Policy
Policy
• Policy iteration
• Policyiteration
iteration isperformed
iteration is
is performed
is performed as
as asas
performed ¼(s) = arg max
X
Psa (s0 )V ¼ (s0 )
a2A 0
s 2S
1. Initialize π randomly
1. 1. Initialize
1. Initialize
Initialize π ππ randomly
randomly
randomly
2. Repeat until convergence { • It is feasible to perform iterative update towards th
2. 2.
RepeatRepeat
2. Repeat
a)until until
Letuntil convergence
V :=convergence
convergence
V¼ { {{ value function and optimal policy
¼ • Value iteration
a) Let
a) Letb) VFor V
:=each:=
¼ V
V state, update • Policy iteration
b) Forb) each each
For each
state, state,
state, update P (s0 )V (s0 )
update
¼(s)update
X
= arg max X sa
Xs0 2S
a2A
¼(s) = arg max P0 sa (s0 )V
0 (s0 )
} ¼(s) = arg max a2A Psa (s )V (s )
a2A 0
s0 2S s 2S
•} The}step of value function update could be time-consuming
• The• step
The of
step value
of value
value function
function
function update
update
update could
couldcould betime-consuming
be time-consuming
be time-consuming
Policy Iteration
Policy Iteration
•• Policy
Policy evaluation
evaluation
–• Estimate
Estimate VVππ
–• Iterative policy
Iterative policy evaluation
evaluation
Policy improvement
•• Policy improvement
–• Generate ¼ 0 ¸ ¼
Generate
–• Greedy policy
Greedy policy improvement
improvement
Evaluating a Random Policy in the Small Gridworld
Evaluating a Random Policy in the Small Gridworld
r = -1
on all transitions
r = -1
on all transitions
K=1
K=2
Evaluating a Random Policy in the Small Gridworld
Evaluating a Random Policy in the Small Gridworld
Vk for the Greedy policy
random policy
Vk for the w.r.t. Vpolicy
Greedy k
random policy w.r.t. Vk
K=3 K=3
V := V ¼
K=10K=10 Optimalpolicy
Optimal policy
K=∞
K=∞
Value Iteration vs. Policy Iteration
Value Iteration vs. Policy Iteration Value Iteration & Policy Iteratio
Value Iteration vs. Policy Iteration
Value Iteration vs. Policy Iteration
• Note that the value function and policy are correlated
V ¼ (s) = R(s) + °
X
Ps¼(s) (s0 )V ¼ (s0 )
Value iteration
Value iteration Policy
Policyiteration
iteration ¼(s) = arg max
s0 2S
X
Psa (s0 )V ¼ (s0 )
Value iteration
Value iteration Policy
Policy iteration a2A
s0 2S
1. For
1. 1.each
ForForstatestate
each s, initialize
s, V(s)V(s)
initialize = 0.= 0. 1.1. iteration
Initialize
Initialize ππ randomly
• It is feasible to perform iterative update towards the
randomly
value function and optimal policy
each state s, initialize V(s) = 0. 1. Initialize π randomly • Value iteration
1. For each state s, initialize V(s) = 0. 1. Initialize π randomly
2. Repeat until
2. 2. Repeat
Repeat convergence
until
untilconvergence
convergence { {
{ 2.2.2. Repeat until
Repeatuntil
Repeat until convergence
convergence
convergence {{
• Policy iteration
{
2. Repeat until convergence { 2. Repeat until convergence {
For
For eachForeach
eachstate,
state, update
update
state, update a)a) Let
Let
a) a)Let VLet
V :=
:= VV ¼:= V
¼V ¼
For each state, update
XXX b) b)b) each
b)For Foreach
For
For each
each
state, state,
state,
state,
update update
update update
V (s) = R(s) + max ° 0 0 (s0 )V0 0(s0 )
Psa
V (s) = VR(s)
(s) =+ max
R(s) °a2A° P0 sa
+ max
a2A
Psa(s(s)V (s))
)V (s X XX 0 0
a2A s 2S
s0 2S ¼(s) = arg max maxPsa (s0 P
¼(s) = arg )Vsa(s(s0 ) )V (s0 )
s0 2S ¼(s) =
a2Aarg
a2Amax
s0 2S s0 2S
Psa (s )V (s0 )
} } a2A
s0 2S
} } }}
Remarks:
Remarks:
}
Remarks:
1. Value
1. Value iteration
iterationisisa agreedy
greedyupdate
update strategy
strategy
Remarks:
1. 2.
2.Value
In iteration
policy is
iteration,a greedy
the value update
function strategy
update by bellman equation is costly
In policy iteration, the value function update by bellman equation is costly
1. Value
2. 3. iteration
3.In For
policy
For is
small-space
small-spacea greedy
iteration,
MDPs,
MDPs, update
thepolicy
value strategy
function
iteration
policy update
is often
iteration veryby
is often Bellman
fast
very and andequation
fast converges isquickly
quickly
converges costly
2. In3.policy
4.For
4. small-space
For
For large-space
iteration,
large-space MDPs,
MDPs,
the value
MDPs, policy
value iteration
iteration
function
value iteration is ismore
isupdate
more often very
practical
by fast and
(efficient)
bellman
practical converges
equation
(efficient) quickly
is costly
5. If there is no state-transition loop, it is better to use value iteration
4. 5.
3. For ForIf large-space
small-space
there is no MDPs,MDPs, valueiteration
policy
state-transition iteration
loop, it isisis moretovery
often
better practical
fast(efficient)
use value and converges quickly
iteration
4. For large-space MDPs,
My point of view: value iteration
value iteration is like SGDis more
and policypractical
iteration is(efficient)
like BGD
My point of view: value iteration is like SGD and policy iteration is like BGD
5. If there is no state-transition loop, it is better to use value iteration
Convergence Analysis
• How do we know that value iteration
converges to v∗?
• Or that iterative policy evaluation converges
to vπ? And therefore that policy iteration
converges to v∗?
• Is the solution unique?
• How fast do these algorithms converge?
• These questions are resolved by contraction
mapping theorem.
Value Function Space
• Consider the vector space V over value functions
• There are |S| dimensions
– Each point in this space fully specifies a value function
v(s)
– What does Bellman backup do to the points in this
space?
• We will show that it brings value functions closer
• And therefore the backups must converge on a unique solution
s3
V={Vs1,Vs2,Vs3}
s2
s1
Value Function ∞-Norm
• We will
ll measure measure
distance distancestate-value
between between state-value
functions u a
functions u and v by
he 1-norm
the ∞-norm (infinity norm or max norm)
e largest
i.e. di↵erence betweenbetween
the largest difference state values,
state values,
s1
Banach’s Contraction Mapping
Theorem
• For any metric space V that is complete (i.e.
closed) under an operator T(v) , where T is a γ-
contraction,
– T converges to a unique fixed point: v = T(v)
– At a linear convergence rate of γ
Bellman
OperatorOperator on Q-function
Model-free Control
Introduction
Policy Evaluation
Bellman on Q-function.
Markov Decision Process
Planning by Dynamic Programming
Model-free Prediction
Policy Iteration
Value Iteration
Theoretical Analysis
• Let usOperator
defineonBellmanQ-function. expectation operator on
Model-free Control
Bellman
Q-function
Bellman expectation
" 1 operator on Q-function.
X
Bellman expectation operator on Q-function.
#
⇡ "1 t #
Q (s, a) = E t X R |S = x, A0 = a
Q ⇡ (s, a) = E Rt |S0 =t x, A00 = a
t=0t=0
" " # #
X1
X1
t
= E R (S0 , A0 ) + Rt+1 |S0 =t s, A0 = a
= E R (S0 , A0t=0 )+ Rt+1 |S0 = s, A0 = a
= E [R (X0 , A0 ) + Q ⇡ (S1 , ⇡ (S t=01 )) |S0 = s, A0 = a]
X
E a)
==r (x, [R+(X0 , A +a QQ⇡⇡ s(S
P 0s)0 |s, 0 0 (S )) |S = s, A =
, ⇡1 ,s⇡ 1 0 0 a]
| S0 X
{z
= r (x, a) +,(T P s 0 |s, a Q}⇡ s 0 , ⇡ s 0
⇡ Q ⇡ )(s,a)
S 0Q-function.
• Bellman optimality operator on Q-function
Bellman optimality operator on
|
(T ⇤ Q) (s, a) , r (s, a) +
P } 0 {z
a) maxa0 2A Q (s 0 , a0 )
S 0 P (s |s,
,(T ⇡ Q ⇡ )(s,a)
82 / 185
Bellman optimality operator Pon Q-function.
(T ⇤ Q) (s, a) , r (s, a) + S 0 P (s 0 |s, a) max 0
a 2A Q (s 0 , a0 )
82 / 185
Bellman Operator on Q-function
Introduction
Markov Decision Process
Planning by Dynamic Programming
Policy Evaluation
Policy Iteration
Value Iteration
Model-free Prediction
Theoretical Analysis
Model-freeIntroduction
Control Policy Evaluation
• Value-based approaches try to find a fixed
Markov Decision Process
Planning by Dynamic Programming
Policy Iteration
Value Iteration
an Operator on Q-function.
Model-free Prediction
the
And the optimal
greedy
Value-based policy
policy Q̂ is
approaches πto∗ findtoa the
tryclose optimal
Q̂ such ⇡ T ⇤ Q̂⇡ ⇤
that Q̂policy
And the greedy policy Q̂ is close to⇤ the optimal policy ⇡ ⇤
Q̂) ⇤
Q ⇡(s;
⇡(s;Q̂)
⇡ Q⇡
⇡
⇤ = Q
⇤
Q ⇡Q =Q
where
where where
⇡(s; Q̂)
⇡(s; Q̂) argmax Q̂(s,a)a)
argmaxQ̂(s,
a2A
a2A
MarkovEvaluation
Policy Decision Process
Markov Decision Process Policy Iteration
Planning by Dynamic Programming
Policy Iteration
Planning by Dynamic Programming Value Iteration
kT ⇤ Q1 T ⇤ Q2 k1 kQ1 Q 2 k1
⇤ ⇤
kT Q1 T Q2 k1 kQ1 Q 2 k1
Qk+1 T ⇤ Qk
Qk+1 T ⇤ Qk
84 / 185
Theoretical Analysis
Model-free Control
Introduction
Operator
Contraction Property on Q-function
of Bellman Operator on Q-function
kT ⇤ Q1 T ⇤ Q2 k1 kQ1 Q 2 k1
kT ⇤ Q⇤1 T ⇤ Q2 k1 kQ1 Q 2 k1
Qk+1 T Qk
Qk+1 T ⇤ Qk
Let’s say Q2 is Q*
84 / 185
Policy Evaluation
Theoretical Analysis
Markov DecisionControl
Model-free Process
Introduction Policy Iteration
Planning by Dynamic Programming Policy Evaluation
Markov Decision Process Value Iteration
Model-free Prediction
Proof Proof of the Contraction Property
Policy Iteration
of the Contraction Property
Planning by Dynamic Programming
Model-free Control
Model-free Prediction
Model-free Control
Theoretical Analysis
Value Iteration
Theoretical Analysis
Therefore, we
Therefore, we get get
Therefore, we get
⇤
sup |(T
⇤ Q1 ) (s, a) (T⇤⇤ Q2 ) (s, a)| sup |Q1 (s, a) Q2 (s, a)|
sup |(T
sup |(TQ
(s,a)2S⇥A ⇤ ) (s, a)
Q1 1 ) (s, a) (T
(T ⇤QQ22))(s, a)|
(s,a)| sup
(s,a)2S⇥A
sup |Q11(s,
|Q (s,a)
a) QQ22(s,
(s,a)|
a)|
aka
(s,a)2S⇥A
(s,a)2S⇥A
Or, succinctly,
(s,a)2S⇥A
(s,a)2S⇥A
Or,Or,
succinctly,
succinctly, kT ⇤ Q1 T ⇤ Q2 k 1 kQ1 Q2 k 1
⇤
kT⇤ Q
kT Q1 1 TT⇤⇤QQ285
2kk/1
1
kQ
185 kQ11 Q22kk1
1
Proof of the Contraction Property (2)
Proof of the Contraction Property
Introduction
Introduction Policy Evaluation
Markov Decision Process
Markov Decision
Policy Process
Evaluation
Policy Iteration Policy Iteration
Planning by Dynamic
Introduction
Planning by Dynamic Programming Programming
Policy
Value Evaluation
Iteration Value Iteration
Markov Decision
Model-free Process Model-free
Prediction Prediction
Theoretical Analysis
Policy Iteration
Theoretical Analysis
• Banach’s fixed-point theorem can be used to show the
Planning byModel-free
Dynamic Programming
Control Model-free Control
Value Iteration
Model-free Prediction
Theoretical Analysis
Model-free Control
Proof
Banach’s Proof
of the
existence and of the theorem
Contraction
uniqueness
fixed-point Contraction
Property
of the Property
(2)be
optimal
can to (2)
value
used show the
Proof of the Contraction Property (2)
function.
existence and uniqueness of the optimal value function.
• Also, since
Also, we
since
Banach’s have
we Q ⇤fixed-point
have theorem
fixed-point
Banach’s =T ⇤Q
can be⇤theorem
,used to show the used to show th
can be
Banach’s fixed-point theorem can be used to show the
existence and uniqueness
existence of the
and optimal value
uniqueness of function.
the optimal value functio
existence and uniqueness of the optimal value function.
kT ⇤ Q
Also,
• Thus, forAlso,since
value 0 weQ ⇤
k
have
sinceiteration
Also, Q=⇤ = T ⇤⇤Q ⇤ ,
kT
we have Q where Q T ⇤ ⇤⇤ ⇤
Q
= T Q ,Qk+1 ← T Qk, we have
1since we have
⇤ ⇤ 0 Q =T Q
⇤ ⇤ k∗1, kQ 0 Q ⇤
k1
kT ⇤kT
Q0⇤ Q Q ⇤ Q
k1 ⇤ ⇤ ⇤
⇤ =⇤ kT Q⇤0 ⇤ T Q
kT
⇤
⇤ k⇤1 ⇤ kQ0 ⇤ Q⇤ ⇤⇤
Q0 k1where
T=QkT Qkk1
Therefore, for0 value k1 Q=0kT Q
iteration k1QQk+1
0 kQT0 QT 1
Q
k1 kQhav
k , we 0
• FinallyTherefore,
asTherefore,
k-> ∞, the
for for
value RHS converges
iteration
value where
iteration whereQQ to zero,
TT
⇤Q
⇤Q showing
, we
we have
Therefore, for value k+1
iteration k
where k, Q have T ⇤ Q , w
thatkQk Q k1 = (T ) Q0 T Q
⇤ ⇤ (k) k+1
⇤ ⇤
k+1
k
kQ0 Q ⇤ k1
k
⇤ (k)
kQkkQkQ ⇤ kQ1⇤ k=
1 =(T )
(T ⇤ (k)
)
⇤ Q Q
0 0 T ⇤ ⇤⇤ ⇤
TQ Q
⇤ (k)
k1
kkQ
kQ0 Q
⇤⇤
⇤0 ⇤ kk1
Q 1 k
kQk Q k1 = (T )11Q0 T Q kQ0
• This means
Finally, As Q k →! Q
1,∗ the RHS converges to zero, 1 showing tha
Finally, As kAs
Finally, k! 1,1,
k! thethe
RHS converges
RHS convergestotozero,
zero,showing
showing that
that
kQkQ QkQ⇤⇤kkQ1
k kkQ ⇤ k!
1 1 ! 0.
Finally,
!
0. 0. This
As
This
This k ! 1,
shows
shows
shows that
that that
the
QQk !
k Q
RHS
!QQ
k ⇤!
. . Q ⇤ . to zero, show
⇤converges
kQk Q ⇤ k1 ! 0. This shows that Qk ! Q ⇤ .
86 / 185
86 / 185
86 / 185
Markov Decision Process
Policy Iteration
Planning by Dynamic Programming
Value Iteration
Model-free Prediction
Theoretical Analysis
Model-free Control
87 / 185
Theoretical Analysis
Model-free Control
88 / 185
ptimality Backup is a Contraction
Optimality Backup is a Contraction
Bellman Optimality Backup is a Contraction
(2)
(2) (2) (2)
(2) a
a00 (2) a(2)
a (2) a(2)
a (2) (2)
Episode
Episode 2: 2:
(2)
ss0 ¡ ¡
¡¡
¡¡ ¡¡ ! (2)
ss1 ¡ ¡¡¡11¡
¡¡ ¡¡ ¡!! (2)
ss22 ¡ ¡¡¡22¡
¡¡ ¡¡ ! ss(2)
¡! ¢ ¢¢ ¢¢ ss(2)
Episode 2: 0 ¡ R(s )
¡ !
(2)
(2)
1 R(s ) (2)
(2) (2)
R(s22 ))(2)
3
3 ¢ T
T
R(s00 ) R(s11 ) R(s
(1) (1) (1) (1) (1) (1)
(1) a0 (1) a1 (1) (1) a0 a1
a2 (1) (1) (1)
(1) a2 (1) (1)
sode 1: s0 ¡¡¡¡ ¡! s1 1:¡¡¡¡s¡0! ¡
Episode s¡
2
¡ ¡ ¡
¡!
¡¡ ¡s¡1! ¡
¡
s3
¡¡ ¢¡!
¢ ¢ ssT2
¡¡¡¡¡! s3 ¢ ¢ ¢ sT
R(s0 )(1) R(s1 )(1) R(s0 )(1)
R(s2 )(1) R(s1 )(1) R(s2 )(1)
…
…
…
• Learn an MDP model from “experience”
arn an MDP• model
–Learn an MDP
from
Learning model from
“experience”
state transition “experience”
probabilities Psa(s’)
• Learning
Learning state transition state transition
probabilities probabilities Psa(s’)
Psa(s’)
#times we took 0 #times
action a in we took
state s action
and gota to state ss0 and got to state s0
in state
0
sa (s ) =
Psa (s ) =
#times we took action #times
a in we took
state s action a in state s
• Algorithm
1. Initialize π randomly.
2. Repeat until convergence {
a) Execute π in the MDP for some number of trials
b) Using the accumulated experience in the MDP, update our estimates for Psa
and R
c) Apply value iteration with the estimated Psa and R to get the new estimated
value function V
d) Update π to be the greedy policy w.r.t. V
}
•• So
So far
far we
we have
have been
been focused
focused on
on
•• Calculating the
Calculating the optimal
optimal value
value function
function
Model-free
••
given
Learning the
Learning
a known MDP Reinforcement
the optimal
model
given a known MDP model
Learning
optimal policy
policy
•• i.e.
i.e. the
the state
state transition
transition P
Psa(s’)
sa
(s’) and
and reward
reward function
function R(s)
R(s) are
are explicitly
explicitly
given
• Ingiven
realistic problems, often the state transition and reward
•• In
In function
realistic are not explicitly
realistic problems,
problems, often the given
often the state
state transition
transition and
and reward
reward
function
function are
• Forare not explicitly
not explicitly
example, we havegiven
given
only observed some episodes
•• For
For example,
example, we
we have
have only
only observed
observed some
some episodes
episodes
(1) (1) (1)
a(1) a (1) a (1)
Episode 1:1:
Episode 1:
Episode
(1)
(1)
ss00 ¡¡
¡
a
¡¡
¡¡
00
¡¡
¡!
!
(1)
(1)
ss11 ¡
¡¡
a11
¡¡
¡¡
¡¡
¡!
!
(1)
ss2(1) ¡
¡¡
¡
a22
¡
¡ ¡
¡ ¡
¡!
! s
s
(1)
3
(1) ¢ ¢ ¢ s(1)
¢ ¢ ¢ sT
(1)
(1)
R(s00 ))(1) R(s (1) 2 (1) 3 T
R(s R(s11 ))(1) R(s
R(s22 ))(1)
#points in circle
Circle Surface = Square Surface £
#points in total
Monte-Carlo Methods
• Go Game: to estimate the winning rate given the current state
• state s is visited
R1 R(i) R2 R R
(i)
3 R 2
(i) 3
2 3
• Every
episode
• Everytime-step tt that
• Every states iss tvisited
time-step isthat
visited
in anin
state anvisited in an
s is
episode
•episode
episode
Increment
time-step
• Every time-step that
episode
counter
state
t that state s is visited in an
N (s) Ã N (s) + 1
•episode
• Increment counter
• Increment
Increment
• Increment • Increment
counter N (s) Ã counter
counter
total N (s) + 1 N (s) Ã N (s) + 1
return NS(s) (s) Ã ÃN S ((s)
s) ++G 1t
• Increment
• Increment total counter
return
• by S (N
s ) (s)
à SÃ(s)N+ (s)
G t + 1
• Increment
• •
Value is total
estimated
Increment total
• Increment• total
• Value is estimatedreturn
Increment
mean
return
by return
mean
S total
return
( s ) Ã
return
return
SV
S (s) Ã
V
(s
(
(
)
s
s)
)
S
+=
=
SG
(Sst)(sÃ
(s) +return
S ( s )
S ((ss))+ Gt
)=N
Gt s) V (s) = S (s)=N (s)
=N (
• By
• Value
law oflaw
islarge Value
numbers
estimated is
by mean estimated
return by mean
• Value •is• estimated
By Value of large by
numbers
is estimated
• By¼law mean
by
of mean
large return
return
numbers
V (s) = S (s)=N (s)
V (s) = S (s)=N (s)
• By law
• Byoflaw
Vlarge
(s)Vof! numbers
)V
(slarge ¼
! V(s ) asNN(s(s))!
)(sas !1 1
• By law of large numbers
¼
numbers ¼
V (s) ! V (s) as N (s) ! 1
V (s) V!(sV) !
(sV) ¼as(sN (s)N!
) as (s)1! 1
Incremental Monte-Carlo Updates
Incremental Monte-Carlo Updates
Incremental
• Update Monte-Carlo
V(s) incrementally Updates
after each episode
• Update V(s) incrementally after each episode
• For each state St with cumulated return Gt
• •For each state
Update V(s) Sincrementally
t with cumulated return
after eachGtepisode
• For each state St with cumulated return Gt
N (St ) Ã N (St ) + 1
1
V (S ) Ã V (S )
N (St ) Ã N (StN+ )+ (Gt ¡ V (St ))
(St1)
t t
1
• For non-stationaryVproblems
(St ) Ã V(i.e.
(St )the
+ environment could
(Gt ¡ V (St ))be
• Forvarying
non-stationary problems
over time), it can be useful(i.e. N the
(S
to track ) environment
t a running mean, could be
varying
•i.e. over
Forforget old time),
episodesit
non-stationary can be(i.e.
problems useful to track a running
the environment could be mean,
i.e. varying
forget over
old episodes
time), it can be useful to track a running mean,
i.e. forget oldV episodes
(St ) Ã V (St ) + ®(Gt ¡ V (St ))
• MC methods
• MC methods learn directly
learn from
directly episodes
from episodesof of
experience
experience
• MC is model-free: no knowledge of MDP transitions / rewards
• MC is model-free: no knowledge of MDP transitions / rewards
• MC learns from complete episodes: no bootstrapping (discussed
• MC learns from complete episodes: no bootstrapping (discussed
later)
later)
• MC usesuses
• MC the the
simplest possible
simplest idea:
possible value
idea: = mean
value = meanreturn
return
• Caveat: cancan
• Caveat: onlyonly
apply MCMC
apply to episodic MDPs
to episodic MDPs
– All•episodes mustmust
All episodes terminate
terminate
Temporal-Difference Learning
• TD methods learn directly from episodes of experience
• TD is model-free: no knowledge of MDP transitions /
rewards
• TD learns from incomplete episodes, by bootstrapping
• Temporal-Difference Learning
TD updates a guess towards a guess
Observation Guess of
future
Monte
Monte Carlo vs. Temporal
policy π Difference
• TheMonteCarlo
Carlovs.
same vs.πTemporal
goal: Temporal Difference
• IncrementalDifference
learn every-visit Monte-Carlo
• The same goal: learnVVππfrom episodesofof
from•episodes experience
experience underunder
• The
policy πsame
policy goal:
π goal: learn V from Update value
episodes of V(St) towardunder
experience actual return Gt
• •The
Thesame
same goal:learn
π
learnVVπfrom
fromepisodes
episodesofofexperience
experience under
under
policy π V (S ) Ã V (St ) + ®(Gt ¡ V
• Incremental
Incremental
• policy every-visitMonte-Carlo
policyππ every-visit Monte-Carlo t
•• Incremental every-visitMonte-Carlo
Monte-Carlo
Update
Incremental
–
• • Update value
value
Incremental V(S
V(S ) )
every-visittoward
t toward
every-visit actual
actual
Monte-Carlo
• return
return
Simplest Gt Gt
temporal-difference learning a
• Update value V(S ) toward actual return G
• • Update
t
Updatevalue
valueV(S t) toward actual return GtG t
t
V(S t ) toward actual return
•+Update t value V(S ) toward estimated return
VVV(S
(S t )
t ) ÃÃ V
V (S
(S t )
t )
+ ®(G
®(G
(S ) Ã V (S ) + ®(G ¡ V (S )) t ¡ t ¡
V V
(S t
(S
)) t ))
t
V (S
t t ) Ã V (S
t t ) + ®(G
t t ¡ V (S
t t ))
V (St ) Ã V (St ) + ®(Rt+1 + °V (S
•••Simplest
• Simplest
Simplest temporal-difference
temporal-difference
temporal-difference
Simplest temporal-difference learning
learning
learning
learning algorithm:
algorithm:
algorithm: TD TD TD
TD
algorithm:
• Simplest temporal-difference learning algorithm: TD
• Update value V(S ) toward estimated
•• • Update
Update value V(S t ) toward return R + °V (S )
• TD target:
estimated return Rt+1 + °V (S) t+1
t+1) t+1
• Update Updatevalue
value
valueV(S
V(S
V(St))ttoward
towardestimated
t t) toward estimatedreturn
estimated RR
return
return
• TD error:
t+1 +R°V
t+1 +
(S
°V
t+1 +t+1°V
t+1
(S ) (St+1 )
±t = Rt+1 + °V (St+1 ) ¡ V (St )
VVV(S
(S
(St )tt))Ã
ÃÃVV (S
V(S(S )+
t )tt)++ ®(R
®(R
®(R
t+1
t+1
+°V
++°V
t+1 °V
(S(S(S
t+1 )¡
t+1
t+1 )¡)V¡
V(SV (S
t ))
(S t ))t
))
V (St ) Ã V (St ) + ®(Rt+1 + °V (St+1 ) ¡ V (St ))
•• •TD
TDTDtarget:
TD target:
target:
target:RR Rt+1
t+1 ++°V°V(S(S
t+1 + °V (S t+1 ) ))
t+1
t+1
• TD
•• •TD
target:
TD error:R±±tt+1
error:
TDerror:
TD error: ±tt===R+R
R
t+1°V
++
t+1
t+1 +(S
°V°V
°V (S(S
t+1(S )t+1
t+1 )¡
t+1))¡V¡V(S
V(S)t )t )
t(S
Gt is the award when reached home, so By contrast, in TD, each estimate would
the change can only be made offline after be updated toward the estimate that
have reached home immediately follows it.
Advantages and Disadvantages of MC vs. TD
• On-policy learning
– “Learn on the job”
– Learn about policy π from experience sampled from π
• Off-policy learning
– “Look over someone’s shoulder”
– Learn about policy π from experience sampled from
another policy μ
State Value and Action Value
State Value and Action Value Value
State Value and Action
State Value and Action Value
T ¡1
G =R
Gtt = Rt+1
t+1 + °R
+ °Rt+2 t+2
+ :::°
+ : : : ° T ¡1 RT
RT
Gt = Rt+1 + °Rt+2 + : : : ° T ¡1 RT
• •State
Statevalue
• State value
value
π
• The
• –The
• State The state-valuefunction
state-value
value
state-value functionVVπ(s)
function V(s) ofanof
πof
(s) anMDP
MDP
an is is
MDP the
the expected
expected
is the expected
return
return starting from state
starting function
from s and then following policy π
• return
The starting
state-value fromstate
Vπs(s)
stateand then
sofand following
then
an MDP policy πpolicy
is following
the expected π
return starting from state
V ¼ (s) = sEand then following policy π
¼ [Gt jSt = s]
¼
V
¼
(s) = E¼ [Gt jSt = s]
V (s) = E¼ [Gt jSt = s]
• Action value
Action
• ••Action
Action value
• Thevalue
action-value
value function Qπ(s,a) of an MDP is the
• The
• • The action-value
expected return
The action-value function
starting from
function
action-value function π Q ππ(s,a)
state
Q (s,a)
Q (s,a) s, of
of an an
taking
of
MDP anisMDP
the isa,
action
MDP isthe
the
expectedreturn
and then
expected
expected return
following
return starting
policy
starting
starting
π from
statestate
fromfrom s,s,taking
s, taking
state actionaction
taking actiona,a,
a,
andthen
and
and thenfollowing
then following policy
Q¼ (s; a)policy
following E¼π πt = s; At = a]
[Gt jS
=policy π
Q¼ (s;
¼ a) = E¼ [Gt jSt = s; At = a]
Q (s; a) = E¼ [Gt jSt = s; At = a]
Bellman Expectation Equation
BellmanExpectation
Bellman ExpectationEquation
Equation
• The state-value function Vπ(s) can be decomposed into
• • The
Thestate-value
state-valuefunction
function
immediate reward plus discountedVV π
π(s)
(s)can
canbe
bedecomposed
decomposed
value into
into state
of successor
immediatereward
immediate rewardplus
plusdiscounted
discountedvalue
valueofofsuccessor
successorstate
state
V ¼ (s) (s)E=¼ [R
V ¼= [Rt+1++°V
E¼t+1
¼
°V ¼(St+1 )jS
t+1)jS t =
t = s] s]
• •The
Theaction-value function
action-value QππQ(s,a)
function can
π(s,a) similarly
can bebe
similarly
• The action-value
decomposed
decomposed
function Q (s,a) can similarly be
decomposed
Q¼ (s; a) = E¼ [Rt+1 + °Q¼ (St+1 ; At+1 )jSt = s; At = a]
Q¼ (s; a) = E¼ [Rt+1 + °Q¼ (St+1 ; At+1 )jSt = s; At = a]
State
State Value
Value andand Action
Action Value
Value
V ¼ (s) Ã s
X
¼
V (s) = ¼(ajs)Q¼ (s; a)
a2A
Q¼ (s; a) Ã s; a
Q¼ (s; a) Ã s; a
X
¼
R(s; a) Q (s; a) = R(s; a) + ° Psa (s0 )V ¼ (s0 )
s0 2S
V ¼ (s0 ) Ã s0
Model-Free
Model-Free
Model-FreePolicyPolicy Iteration
Iteration
Policy Iteration
• Given Given
•• Given state-value
state-value function
function
state-value V(s)
V(s)V(s)
function and
andand action-value
action-value function
function
action-value function
Q(s,a),
Q(s,a),
Q(s,a), model-free
model-free
model-free policy
policy iteration
iteration
policy shall
iteration shall use
useuse
shall action-value
action-value
action-value
function
function
function
•• Greedy
• Greedy
Greedy policy
policy improvement
improvement
policy over
improvement over V(s)
V(s)V(s)
over requires
requires model
model
requires ofMDP
MDP
ofofMDP
model
n n XX oo
new ¼ new (s) 0 ¼ 0(s0 )
¼
¼ = max
(s) = arg arg max a) +a)°+ ° PsaP(ssa0 )V
R(s;R(s; (s )V(s )
a2A a2A 0
s0 2Ss 2S
Wedon’t
We don’t
We don’t know
know
know thetransition
thethe transition
transition probability
probability
probability
•• Greedy
Greedy
• Greedy policy
policy improvement
improvement
improvement overover
over Q(s,a)
Q(s,a)
Q(s,a) isismodel-free
model-free
is model-free
new
new¼ (s) = arg max Q(s; a)
¼ (s) = arg maxa2A
Q(s; a)
a2A
Generalized Policy Iteration with Action-Value Function
Every episode:
• Policy evaluation: Monte-Carlo policy evaluation, Q ≈ Qπ
• Policy improvement: ɛ-greedy policy improvement
MC Control vs. TD Control
• Temporal-difference (TD) learning has several advantages over
Monte-Carlo (MC)
– Lower variance
– Online
– Incomplete sequences
At state
At state s’,
s’, take
takeaction
actiona’a’
• Updatingaction-value
• Updating action-value functions
functions with
withSarsa
Sarsa
With exploration,
average episode
length 17 steps (15 is
the minimum)
Note: as the training proceeds, the Sarsa policy achieves the goal more and
more quickly
Off-PolicyOff-Policy
LearningLearning
• Evaluate
Evaluate target
target policy π(a|s)
policy π(a|s) to compute Vπ(s) orVQππ(s)
to compute (s,a)or Qπ(s,a)
• While following behavior policy (used to simulate trajectories)
Whileμ(a|s)
following behavior policy μ(a|s)
fs1 ; a1 ; r2 ; s2 ; a2 ; : : : ; sT g » ¹
p(x)
• Re-weight
••Re-weight each
Re-weight each instance
each instance by
by ¯(x)
instance by ¯(x)==p(x)
q(x)
q(x)
Importance Sampling for Off-Policy Monte-
Carlo
Importance Sampling for Off-Policy Monte-Carlo
• Use returns generated from μ to evaluate π
• Weight return
• Use
Importance Ggenerated
t according
returnsSampling from μto importance
to evaluate
for Off-Policy π ratio between
Monte-Carlo
policies
• Weight return Gt according to importance ratio between
policies
• Multiply
• Use importance
returns generated ratio
from μ along with
to evaluate π episode
• •Weight
Multiply importance
return ratiotoalong
Gt according with episode
importance ratio between
policies fs1 ; a1 ; r2 ; s2 ; a2 ; : : : ; sT g » ¹
• Multiply importance ¼(a ratio along js
js ) ¼(a with)episode ¼(a js )
¼=¹ t t t+1 t+1 T T
Gt = ¢¢¢ G
fs¹(a
1 ; a1t ;jsrt2); ¹(a
s2 ; at+1
2 ; :js
: t+1 ¹ T jsT ) t
: ; s)T g »¹(a
• Use TDTD
• Use targets
targetsgenerated
generated from
from μμ to
toevaluate
evaluateπ π
• Weight TDTD
• Weight target
targetr+γV(s’)
r+γV(s’) by
by importance sampling
importance sampling
• Only need
• Only a single
need a singleimportance samplingcorrection
importance sampling correction
μ ¶
¼(at jst )
V (st ) Ã V (st ) + ® (rt+1 + °V (st+1 )) ¡ V (st )
¹(at jst )
importance TD
sampling target
correction
0
Q-Learning
Q-Learning
Q-Learning Control
Control
Control Algorithm
Algorithm
Algorithm
At state s, take action a
At state
At state s, take
s, take action
action a a
Observe reward r
Observe
Observe reward
reward r r
Transit to the next state s’
Transit
Transit tonext
to the the next
statestate
s’ s’
Theorem:
•• Theorem:
• Theorem: Q-learning
Q-learning control
control
Q-learning converges
converges
control totothe
to the
converges theoptimal
optimal
optimal
action-value
action-value function
function
action-value function
¤
Q(s;Q(s;
a) !a) !(s;
Q Q¤a)
(s; a)
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8.3-4 (1992): 279-292.
Q-Learning Control Algorithm
Q-Learning Control Algorithm
Q-Learning Control Algorithm
At state s, take action a
At state s, take action a
Observe
At state s,reward r a
take action
Observe reward r
Observe reward r
Transit to the next state s’
Transit to the next state s’
Transit to the next state s’
AtAtstate
states’,s’, take
take action
action argmax
argmax Q(s’,a’)
Q(s’,a’)
At state s’, take action argmax Q(s’,a’)
0
Q(st ; at ) Ã Q(st ; a t ) + ®(r t+1 + °max
Q(st ; at ) Ã Q(st ; at ) + ®(rt+1 + °max Q(st+1 ; a )0 ¡ Q(st ; at ))
a0 0 Q(st+1 ; a ) ¡ Q(st ; at ))
a
Why
•• •Why Q-learning
WhyQ-learning isisan
Q-learningis anoff-policy
an off-policycontrol
off-policy controlmethod?
controlmethod?
method?
• •Learning
– Learning from
Learningfrom SARS
fromSARS generated
SARSgenerated
generatedbyby another
byanother
anotherpolicy
policyμμ
• •The
– The first
Thefirst action
actionaaaand
firstaction and
andthe the corresponding
thecorresponding
correspondingreward
rewardrrare
arefrom
fromμμ
• •The
– Thenext
The nextaction
next actiona’
action a’a’isisispicked
pickedby
picked bythe
by thetarget
the targetpolicy
target policy
policy¼(s
¼(st+1 ) = arg max
t+1 ) = arg max
0
aa0
Q(s
Q(s t+1
t+1
;;aa0 )0 )
•• •Why
Why no
Whyno importance
importancesampling?
noimportance sampling?
sampling?
• •Action
– Actionvalue
Action valuefunction
value functionnot
function notstate
not statevalue
state valuefunction
value function
function
SARSA vs. Q-Learning Experiments
SARSA
• Cliff-walking
– Undiscounted reward Q-learning
ARSA – Episodic task
– Reward = -1 on all
transitions
or each state-action-reward-state-action by the current
olicy
– Stepping into cliff area
incurs -100 reward
At state and
s, take action a
sent theObserve
agentreward backr to
Q-Learning Control
the startTransit Algorithm
to the next state s’
Step-by-step
Step-by-step example
example of Q-learningof Q-learning
• Agents are expected to find a route outside the room. Once
going to
Agents arethe outside,
expected toi.e.,
findNo.5, reward
a route 100.
outside the room. Once going
to the outside, i.e. No.5, reward 100.
• Each room, including outside, is a state, and the agent’s move
http://mnemstudio.org/path-finding-q-learning-tutorial.htm
from one room to another will be an action.
tep-by-step example
• Agent can of Q-learning
learn through experience.(3)
• The
Agent canagent
learn can passexperience.
through from one room to another
The agent butfrom
can pass hasone
no knowledge of the environment (model), and doesn’t
roomknow
to another
whichbut has no knowledge
sequence of the
of doors lead to environment,
the outside. and
Introduction
On-Policy Monte-Carlo Control
Markov Decision Process
2 3
• We use -1 in the table0 to0represent
0 0 0 no 0 connection
between nodes. 6
2 3
00 00 0 0
0 0 00 0 0 7
66 7 7
6 666000 000 00 000 00 000 777 0 0 7
Q0 =Q06
=
6 66400 00 0 00 0 00 775 0 0 7
7
6 0 0 0 0 0 0 7
4 00 00 0 00 0 00 0 0 5
148 / 185
Introduction
On-Poli
Markov Decision Process
• There are two possible actions for the current state 1: from one room to another will be an ac
146 / 185
go to state 3, or go to state 5.
• By random selection, we select to go to 5 as our action.
– Now let’s imagine what would happen if our agent
were in state 5.
• Look at the sixth row of the reward matrix R (i.e. state
5). It has 3 possible actions: go to state 1, 4 or 5.
Q(state, action) = R(state, action) + Gamma ∗ Max[Q(nextstate, allactions)]
episode.
147 / 185
e 5• is
Ourtheagent
goal has
state,
thewe finish this
updated episode.
matrix Q as: Our agent’s
ow contain updated matrix Q as:
2 3
0 0 0 0 0 0
6 0 0 0 0 0 100 7
6 7
6 0 0 0 0 0 0 7
Q2 = 66 0 80 0 0 0 0 7
7
6 7
4 0 0 0 0 0 0 5
0 0 0 0 0 0
Introduction
On-Policy Monte-Carlo Control
• IfStep-by-step
our agent learns moreofthrough
example further
Q-learning (9)episodes, it
will finally reach convergence values in matrix Q.
• Once theagent
If our matrix Q gets
learns close enough
more through to a state
further episodes, offinally
it will
convergence, we know
reach convergence valuesour agentQ.has
in matrix Oncelearned
the matrixtheQmost
gets
optimal
close paths
enough toto athe goal
state state.
of convergence, we know our agent has
learned the most optimal paths to the goal state. Tracing the best
• Tracing the best
sequences sequences
of states ofasstates
is as simple is as
following thesimple
links with asthe
following
highest the links
values withstate.
at each the highest values at each
state.
154 / 185
Relationship Between DP and TD
Relationship Between DP and TD
Full Backup (DP) Sample Backup (TD)
V ¼ (s) Ã s
s; a
r
Bellman
a s0
Expectation r
Equation for V ¼ (s0 ) Ã s0 s0 ; a0
V ¼ (s)
Iterative Policy Evaluation TD Learning
Q¼ (s; a) Ã s; a s; a
r
Bellman r
s0 s0
Expectation
Equation for s0 ; a0
Q¼ (s0 ; a0 ) Ã s0 ; a
Q¼ (s; a)
Q-Policy Iteration SARSA
Q¤ (s; a) Ã s; a
s; a
Bellman r r
Optimality s0 s0
Equation for
Q¤ (s0 ; a0 ) Ã s0 ; a0 s0 ; a0
Q¤ (s; a)
Q-Value Iteration Q-Learning
RelationshipBetween
Relationship BetweenDP
DPand
andTDTD
Full Backup (DP) Sample Backup (TD)
Convergence
onvergence Proof
Proof of Q-learning of Q-learning
• Recall that the optimal Q-function (SxA → R) is a
The optimal Q-function is a fixed point of a contraction
fixed point
operator of aforcontraction
T , defined operator
a generic function T: R as
Introduction
q : SxA !
Markov Decision Process
Planning by Dynamic Programming
On-Policy Monte-Carlo Control
On-Policy TD Learning
O↵-Policy Learning
Model-free Prediction
Value Function Approximation
X Model-free Control
(T ⇤
Q) (s, a) , r (s, a) + P
Convergence Proof of Q-learning s 0
|s, a max Q s 0 0
,a
a0 2A
S0
Let{s{st }t}be
• Let beaoptimal
The a sequence
sequence is aof
of states
Q-function states
obtained
fixed aby following
point of following
contraction policy policy
π,{a
⇡, {att}} the
the sequence
operator T , defined
sequence of corresponding
for a generic function
of corresponding
X
q : SxA
actions and!{r actions
R as
t } the and
{rt} the sequence
sequence of (T Q) (s, a) , rof
obtained
⇤ rewards. rewards.
(s, a) + Then, P given
s |s, a any
max Qinitial
s , a estimate
0
a0 2A
0 0
Given a finiteupdate
the above MDP, the
ruleQ-learning
converges algorithm
w.p.1 togiven
the by the above
optimal
update rule converges
Q- function as long w.p.1
as to the optimal Q- function as long as
X X
↵t (s, a) = 1, ↵t2 (s, a) < 1
t t
For all(s,
for all (s, a)
a)∈S×A.
2 S ⇥ A.
– At the (t + 1)th update, only the component (st, at) is updated. The
theorem suggests
The step-sizes thatmeans
↵t (s, a) all state-action
that, atpairs
thebe(tvisited
+ 1)th infinitely
update, often.
only
the component (st , at ) is updated. The theorem suggests that all
state-action pairs be visited infinitely often.
140 / 185
Introduction
On-Policy Monte-Carlo Control
Markov Decision Process
On-Policy TD Learning
Introduction
Planning by Dynamic Programming On-Policy
O↵-Policy LearningMonte-Carlo Control
defined
The random as
process {4 t } taking values in R n and defined as
The random process {4 } taking values in Rn and defined as
t
4t+1 (x) = (1 ↵t (x))4t (x) + ↵t (x)Ft (x)
4t+1 (x) = (1 ↵t (x))4t (x) + ↵t (x)Ft (x)
converges
converges to tozerozero
w.p.1w.p.1
under under the following
the following assumptions:
assumptions:
convergesP
•0 ↵t 1, t ↵P
to zero w.p.1 under P the following
2 ℱ# : All assumptions:
the past information at t
t (x) = 1 and t (x) 2< 1
t ↵P
•0t (x)|F
• kE [F ↵t t 1, t ↵kt (x)
]kW t k =1, and <t 1↵t (x) < 1
with
⇣ W ⌘
• kE [Ft (x)|Ft ]kW ⇣ k t2kW , with ⌘ C ><0 1
• var [Ft (x)|Ft ] C 1 + k t kW , 2for
• var [Ft (x)|Ft ] C 1 + k t kW , for C > 0
• Tommi Jaakkola,
Tommi Jaakkola, MichaelMichael
I. Jordan, andI.Satinder
Jordan, and
P. Singh. On Satinder P. ofSingh.
the convergence stochasticOn the
iterative
dynamic programming
Tommi algorithms.
Jaakkola, Neural Computation,
Michael I. Jordan, 66):1185-1201, 1994. P. Singh. On
and Satinder th
convergence of stochastic iterative dynamic programming
Model-freeIntroduction
Control
On-Policy Monte-Carlo Control
Markov Decision Process
ergence Proof
onvergence ProofofofQ-learning (4)
Q-learning (4)
Planning by Dynamic Programming
Model-free Prediction
On-Policy TD Learning
O↵-Policy Learning
Convergence
e We
want to to
want useuse Proof
the
the of Q-learning
stochastic
stochastic (4)theorem
optimisation
optimisation theoremtotoprove
provethethe
nverge Stochastic
converge
• We want to useapproximation
of of Q-learning.
Q-learning. theorem
the stochastic optimisation helps
theorem to prove the
prove the
e We start
start by by rewritingthe
rewriting theQ-learning
converge of Q-learning Q-learning update
update rule
ruleasas
converge of Q-learning.
We Start
– start Qby
by
Qt+1 (srewriting
(s , a )the
rewriting
t+1 =the
(1 Q-learning
(stt ,,update
↵↵tt(s
Q-learning
t , t at )t = (1 update
aatt))Q
))Q tt(s
rule , arule
(st ,tas ta)t ) as
Q
+↵t+1
t (stt,, att )[r
(s ) =t +
(1 max
↵t (sQ at t+1
t ,t (x ))Q,t (sb)]
t , at )
+↵t (st , at )[rt + max b2A Qt (xt+1 , b)]
+↵t (st , at )[rt + b2Amax Qt (xt+1 , b)]
b2A
⇤ (s , a ) and letting
Subtracting from both sides the quantity Q
– Subtracting from both sides the quantity
⇤ t t Q∗(s , a ) and
btracting from both sides the quantity Q ⇤(st , at ) andt letting t
letting from
Subtracting both sides the quantity ⇤Q (st , at ) and letting
4t (s, a) = Qt (s, a) Q ⇤(s, a)
4t (s, a) = Q (s, a) Q (s, a)
4 (s, a) =t Q (s, a) Q ⇤ (s, a)
t t
yields
– We thus
lds yields 4t (shave:
t , at ) = (1 ↵t (st , at ))4t (st , at ))
4t (s4 att), =
t , t (s at )(1 (stt(s, t ,ata))4
= (1 ↵t ↵ t ))4tt(s
(stt , aatt⇤))
))
+↵t (s, a)[rt + max Qt (st+1 , b) Q (st , at )].
b2A ⇤⇤
+↵ta)[r
+↵t (s, (s, a)[r
t + t + max
max QQ t (s
t (s b)
t+1,, b)
t+1 Q
Q (s(st ,t a, ta)].
t )].
b2A b2A
142 / 185
142 / 185
142 / 185
Planning by Dynamic Programming
Introduction O↵-PolicyMonte-Carlo
On-Policy Learning Control
Model-free
Markov Prediction
Decision Process
Planning by Dynamic Programming
Value Function
Introduction On-Policy Approximation
TD Learning
Model-free Control On-Policy
O↵-Policy Monte-Carlo Control
Learning
MarkovPrediction
Model-free Decision Process
On-Policy TD Learning
– where
where S(s, S(s,a) isis aa random sample state obtained from the
where S(s, a) a)
where is
a arandom
isS(s, a)random sample
sample
random state
state
sample obtained
state obtainedfrom
obtained from
from the Markov
the
the Markov
Markov
chainMarkov
(S, P
chain ),chain
(S, we
P have
), from
we havethe transition model, we have
chain (S, Pa ), we have
a a
X X
E [Ft (s,Ea)|F
[Ft (s,
t ] =X
a)|F t ] = P a P
(s, ya (s,
) y
r )
(s, r (s,
a, y a,
) y
+ ) +maxmax
Q Q
t (y
t (y
, ,
b)b) Q
Q ⇤⇤ (s, a)
(s,⇤a)
E [Ft (s, a)|Ft ] = y 2SPay(s, 2S y ) r (s, a, y ) + b2A max Qt (y , b) Q (s, a)
b2A
b2A
y 2S ⇤
= (TQ
= (TQ t ) (s, a) Q a)
Q ⇤ (s, (s, a)
t ) (s, a)
⇤
– Using =
Usingthe
(TQ
the fact t )
that
fact that
(s,
Q
a)
Q
⇤ ∗
= = Q
TQ
TQ ⇤ ∗(s,
, , a)
⇤
Using the fact that Q = TQ , ⇤
Qt⇤(s,=a)|F
Using the fact thatE[F TQt ]⇤ ,= (TQt )(s, a) (TQ
⇤
⇤
)(s, a) .
E[Ft (s, a)|Ft ] = (TQt )(s, a) (TQ )(s, a) .
E[Ft (s, a)|Ft ] = (TQt )(s, a)
143 / 185
(TQ ⇤ )(s, a) .
143 / 185
onvergence Proof of Q-learning (6)
Convergence Proof of Q-learning
m •Slide. 86that
Recall we know
we that that
know this operator
this is a contraction
operator is a in t
From Slide.
-norm, i.e. 86 we know that this operator is a contraction in the
contraction
sup-norm, i.e. in the max-norm, i.e.
⇤ ⇤ ⇤ (k) ⇤⇤ ⇤ k ⇤
kQk kQkQ kQ1k1
= = (T(T )) Q
⇤ T
(k)
⇤
Q00 T Q Q k kQ
kQ0 Q 0 k1 k1
⇤ Q
1 1
So,
• So
kE [Ft (s, a)|Ft ]k1 kQt Q ⇤ k⇤1 = k t k1
kE [Ft (s, a)|Ft ]k1 kQt Q k1 = k t k1
Condition 2 is met.
ndition 2 is met.
144 / 185
Planning by Dynamic Programming X TD Learning
On-Policy
Planning by Dynamic E [Ft (s, a)|F
Programming O↵-Policy
] =
t O↵-PolicyPLearning
(s, y ) r (s, a, y ) + max Qt (y , b)
aLearning Q ⇤ (s, a
Model-free Prediction
Model-free Prediction
Value yFunction
Value Approximation b2A
Function Approximation
2S
Model-free Control
Model-free Control
(TQ ⇤ )(s, a) .
• for condition 3:
E[Ft (s, a)|Ft ] = (TQt )(s, a)
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
Reinforcement learning
• {state s, action a, reward r} are made in stages
The outcome of each decision is not fully predictable but can be
observed before the next decision is made
• The objective is
– to maximize a numerical measure of total reward
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
Q-learning
• For game, the future discounted reward at time t
Rt = rt + γ rt+1 + γ 2 rt+1..., where γ is discount factor (aka prob. of terminating)
• Bellman equation
Q* (s, a) = Es'~ε [r + γ max a' Q* (s, a) | st = s, at = a],
where ε is sample distribution from an emulator
• So iteratively update Q* as above
and *
a = max Q (s , a )
t a
*
t t
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
A simplified Q-learning algorithm
Maintain a Q table
and keep updating it
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
, a0
We address these instabilities with a novel variant of Q-learning, which
.
uses two key ideas. First, we used a biologically inspired mechanism
o
Deep Q-learning
termed experience replay21–23 that randomizes over the data, thereby
n
removing correlations in the observation sequence and smoothing over
s
changes in the data distribution (see below for details). Second, we used
n
an iterative update that adjusts the action-values (Q) towards target
-
values that are only periodically updated, thereby reducing correlations
d
l • However, in many cases, it is not practical as
with the target.
While other stable methods exist for training neural networks in the
-
e
– Action
reinforcement x State
learning setting, such space is very
as neural fitted large
Q-iteration24
, these
methods involve the repeated training of networks de novo on hundreds
k
d
– lacking
of iterations. Consequently, generalisation
these methods, unlike our algorithm, are
too inefficient to be used successfully with large neural networks. We
-
• Q function is commonly approximated by a
parameterize an approximate value function Q(s,a;hi) using the deep
p function, either linear or non-linear
convolutional neural network shown in Fig. 1, in which hi are the param-
eters (that is, weights) of the Q-network at iteration i. To perform
a experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1)
s Q(s, t in a; ) ≈set *
Q D(s,
at each time-step aθ data t 5a) where
{e1,…,e θ is model
t}. During learning,parameter
we
k apply Q-learning updates, on samples (or minibatches) of experience
s
,
• It can be trained by minimising a sequence of loss
(s,a,r,s9) , U (D), drawn uniformly at random from the pool of stored
f function
samples. The Q-learning update at iteration i uses the following loss
function:
s "# $ 2 # where U(D) is sample distribution from
n 0 0 {
e Li ðhi Þ~ ðs,a,r,s0 Þ*UðD Þ rzc max Q(s ,a ; hi ){Qðs,a; hi Þ a pool of stored samples and θ i− only updates
0 a
every C steps
l
s in which c is the discount factor determining the agent’s horizon, targeth are
i
{
s the parameters of the Q-network at iteration i and hi are the network
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
y parameters used to compute the target at iteration i. The target net-
, a0
We address these instabilities with a novel variant of Q-learning, which
.
uses two key ideas. First, we used a biologically inspired mechanism
o
Deep Q-learning
termed experience replay21–23 that randomizes over the data, thereby
n
removing correlations in the observation sequence and smoothing over
s
changes in the data distribution (see below for details). Second, we used
n
an iterative update that adjusts the action-values (Q) towards target
-
values that are only periodically updated, thereby reducing correlations
d
l • However, in many cases, it is not practical as
with the target.
While other stable methods exist for training neural networks in the
-
e
– Action
reinforcement x State
learning setting, such space is very
as neural fitted large
Q-iteration24
, these
methods involve the repeated training of networks de novo on hundreds
k
d
– lacking
of iterations. Consequently, generalisation
these methods, unlike our algorithm, are
too inefficient to be used successfully with large neural networks. We
-
• Q function is commonly approximated by a
parameterize an approximate value function Q(s,a;hi) using the deep
p function, either linear or non-linear
convolutional neural network shown in Fig. 1, in which hi are the param-
eters (that is, weights) of the Q-network at iteration i. To perform
a experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1)
s Q(s, t in a; ) ≈set *
Q D(s,
at each time-step aθ data t 5a) where
{e1,…,e θ is model
t}. During learning,parameter
we
k apply Q-learning updates, on samples (or minibatches) of experience
s
,
• It can be trained by minimising a sequence of loss
(s,a,r,s9) , U (D), drawn uniformly at random from the pool of stored
f function
samples. The Q-learning update at iteration i uses the following loss
function:
s "# $ 2 # where U(D) is sample distribution from
n 0 0 {
e Li ðhi Þ~ ðs,a,r,s0 Þ*UðD Þ rzc max Q(s ,a ; hi ){Qðs,a; hi Þ a pool of stored samples and θ i− only updates
0 a
every C steps
l
s in which c is the discount factor determining the agent’s horizon, targeth are
i
{
s the parameters of the Q-network at iteration i and hi are the network
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
y parameters used to compute the target at iteration i. The target net-
Deep Q-learning
Algorithm 1 Deep Q-learning with Experience Replay
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights
for episode = 1, M do
Initialise sequence s1 = {x1 } and preprocessed sequenced 1 = (s1 )
for t = 1, T do
With probability ✏ select a random action at
otherwise select at = maxa Q⇤ ( (st ), a; ✓)
Execute action at in emulator and observe reward rt and image xt+1
Set st+1 = st , at , xt+1 and preprocess t+1 = (st+1 )
Store transition ( t , at , rt , t+1 ) in D
Sample random
⇢ minibatch of transitions ( j , aj , rj , j+1 ) from D
rj for terminal j+1
Set yj =
rj + maxa0 Q( j+1 , a ; ✓) 0
for non-terminal j+1
2
Perform a gradient descent step on (yj Q( j , aj ; ✓)) according to equation 3
end for
end for
Second,
Mnih learning
V, Kavukcuoglu K, Silverdirectly from consecutive
D, et al. Human-level samples
control through is inefficient,
deep reinforcement dueNature,
learning[J]. to the strong
2015, correlations
518(7540): 529-533.
between the samples; randomizing the samples breaks these correlations and therefore reduces the
Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:1312.5602, 2013.
variance of the updates. Third, when learning on-policy the current parameters determine the next
DQN results on atari games
DQN Results in Atari
The performance of DQN is normalized with respect to a professional
human games tester
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
Content
• Reinforcement Learning
– The model-based methods
• Markov Decision Process
• Planning by Dynamic Programming
– The model-free methods
• Model-free Prediction
– Monte-Carlo and Temporal Difference
• Model-free Control
– On-policy SARSA and off-policy Q-learning
• Deep Reinforcement Learning Examples
– Atari games
– Alpha Go
The Go Game
• Go (Weiqi) is an ancient game originated from
China, with a history over 3000 years
The board: 19 x 19 lines,
making 361 intersections
(`points').
The move: starting from
Black, stones are placed
on points in alternation
until the end
The goal: control the
max. territory
t a difficult game to learn, so have a fun time playing
riends. Making aeachmove that place
player causes your on
a stone stones (but not
the board your
on his opponent’s)
turn. Players aretofree
have
to no
place
their
liberties is stones
known asat any unoccupied
suicide. intersections
Usually suicide on the board.
is forbidden, but some variations
t and are commonly used throughout theofworld.
the rule
Twoallow
of for suicide, whereby the suicide move causes the stones
ed
• Liberties: the unoccupied intersections (or points)
including International Chess and Chinese Chess. The beauty of Go also lies
in the simplicity of its rules.
stone
tone on the board on his turn. Players are free to place
noccupied intersections on the board.
Diagram 1-5
ones are placed on the board, they are not to be moved
Liberties are marked as X
lso the stones are not to be removed from the board at
A chain players
les explained in the following Sections. Besides, consists of two or more stones that are connected to each other
ack a stone on top of another stone on the horizontally
board. These or vertically, but not diagonally. The liberties of a chain are
ke Go unique compared to most other board counted together as a unit. An example is Diagram 1-5, where the two black
games,
al Chess and Chinese Chess. The beauty stones
of Go also
haveliesa combined total of six liberties marked X. When white has
rules. played at all the positions marked X, such that the two black stones have no
Diagram 1-3 Diagram 1-4 Diagram 1-6 Diagram 1-7 Diagram 1-8
One of the great challenges of AI
• It is far more complex than
chess
– Chess has 3580 possible moves
– Go has 250150 possible moves
• Computers match or surpass
top humans in chess,
backgammon, poker, even https://jedionston.files.wordpress.com/20
Jeopardy 15/02/go-game-storm-trooper.jpg
I State space S
Let’s
Let’s take a look at Diagram 1-6. What if black decides to play at take
1 as ashown
look at Diagram 1-6. What if black decides to play
in Diagram 1-7? Notice that the black stone marked 1 has no in liberties,
Diagram but 1-7? Notice that the
I Action spaceblack
A stone marked 1 has no li
the either.
the three white stones (marked with triangle) have no liberties three white
This stones
I Known transition triangle)
(marked with model: p(s,have
a, s 0 ) no liberties
I Reward on final states: win or lose
Silver, David, et al. "Mastering the game of Go with deep neural networks andstrategies
Baseline tree search." Nature
do not apply:
529.7587 (2016): 484-489. I Cannot grow the full tree
Neumann, L. J., & Morgenstern, O. (1947). Theory of games and economic behavior (Vol. 60). Princeton: Princeton university press.
Russell, Stuart, Peter Norvig, and Artificial Intelligence. "A modern approach." Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs 25 (1995): 27.
Minimax search
• Tic-tac-toe zero-sum two-player game
The minimax algorithm is recursively as
• Employ
b two deep
b convolutional neural networks
olicy
worknetwork Value network
Policy network Policy
Value network
network Value network
Neural network
Neural network
p p (a⎪s) (s′)
p (a⎪s) (s′)
Se
on
lf
ssi
Pla
gre
y
Re
Data
Data
s s′
s s′
urrentSelf-play
player wins) in positions from the self-play data set.
positions
Policy network
hematica,representation predicts
of the p(action
neural network=architecture
a| state =used
s) inValue network predicts the expected terminal
cture. A fast the current player wins) in positions from the self-play data set.
haGo. ofThe
humanpolicy moves
network takes a representation of the board position reward of architecture
a given state s in
work pσ are b, Schematic representation of the neural network used
ts input, passes it through many convolutional layers with parameters
Lsitions.
Silver,
policy David, et
network) or al. AlphaGo.
ρ (RL"Mastering The policy
the game
policy network), andnetwork
of takes
Go with
outputs a representation
deep of theand
neural networks
a probability board
treeposition
search." Nature
lized to pthe
529.7587 s as its input, passes it through many convolutional layers with parameters
SL(2016): 484-489.
ibution σ (a | s) or pρ (a | s) over legal moves a, represented by a
earning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
AlphaGo
AR
• Training for policy network predicting human moves
a b
Supervised trained (SL) from
Rollout policy SL policy network RL policy network Value network 30m positions from the KGS
Policy network Va
Go Server
Neural network
p p p p (a⎪s)
(1)pπ (a | s) : fast prediction
with linear softmax
Policy gradient (2)pδ (a | s) : more accurate prediction
with 12-layer CNN. superversied
(3)
learned from
on
Cla
(1) Se
ati
on
ssi
ssi
(2)
fic
ssi
lay
gre
ati
Cla
Re
on
reward of self-play.
Data
Gradient update
s
Human expert positions Self-play positions ∂log p(at | st )
Δρ ∝ zt
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins)
∂ρ in positions from the self-pl
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation
zt : terminal rewardof the neural network ar
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation
Silver, David,learning
A reinforcement et al. "Mastering the game
(RL) policy network pρ isof Go withtodeep
initialized the SLneural networks and
s as its input, treeit search."
passes through many Natureconvolutional la
policy network,(2016):
529.7587 and is then improved by policy gradient learning to
484-489. σ (SL policy network) or ρ (RL policy network), and
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, rep
AlphaGo
AR
• Training for value network predicting terminal reward
a b
Rollout policy SL policy network RL policy network Value network Policy network Va
Selfplay pp randomly to
train value network
Neural network
p p p p (a⎪s)
Se
Gradient update:
ati
on
ssi
lf P
fic
ssi
fic
∂log vθ (s)
ssi
lay
gre
ati
Δθ ∝ (z − v(s))
Cla
Re
on
∂θ
Data
s
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-pl
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network ar
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation
Silver, David,learning
A reinforcement et al. "Mastering the game
(RL) policy network pρ isof Go withtodeep
initialized the SLneural networks and
s as its input, treeit search."
passes through many Natureconvolutional la
policy network,(2016):
529.7587 and is then improved by policy gradient learning to
484-489. σ (SL policy network) or ρ (RL policy network), and
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, rep
haGo. a, Each simulationedge.Qb, The leafis
by the evaluated
policy
node network σ in
andtwo
may be pexpanded; ways:
the output
the using
probabilities
new node the value
are stored
is processed network
as prior
once vthe
that
track
; and
θaction. by running
mean value of all evaluations r(·) and vθ(·) in the subtree below
Q
h maximum action value byQ,theP policyPnetwork
probabilities P
a rolloutfor each action.
pσ andtothe
the c, At the end
endprobabilities of
of the game a simulation, the
withasthe leaf node
priorfast rollout policy pπ, then
AlphaGo: Monte Carlo tree search
Q Q
Q + u(P) Q + u(P)
output are stored that action.
max Q
prior probability Q P for that r. d,
computing the winner (s, a) of theQ
end of with function Action values are tree
updated
stores anto
probabilities P for each action. c, At the a simulation, the leaf node 23
p learning of convolutional networks, won 11% of games against Pachi search action value Q(s, a), visit count N(s, a),
he new node is processed once Q and 12% Qtrack thea slightly
against meanweakervalueprogram,
of all evaluations
Fuego . 24
r(·) 23and vθand (·) prior
in the subtree
probability P(s,below
a). The tree is traversed by simulation (that
PlearningP of convolutional networks, won 11% of games against Pachi a) descending
(s, is, of the search thetree
treestores an action
in complete value
games Q(s, a),
without visit count
backup), N(s,
starting
obabilities are stored as prior
and 12% againstthat
a action.
slightly weakerp
program, Fuego 24
. and prior probability P(s, a). The tree is traversed by simulation
Q Reinforcement
Q learning of value networks from the root state. At each time step t of each simulation, an action a(th
t
d of a simulation, the leaf node The final stage of the training pipeline focuses on position evaluation, is, isdescending selected from thestate
treest in complete games without backup), starti
r
predictsr the outcome from r
Reinforcement
estimating learning r
of value
a value function networks
vp(s) that posi- from
• Combing the two networks (p(a|s) and v(s)) together
the root state. At each time step t of each simulation, an action
tion s of games played by using policy p for both players28–30
n 11% of games against The
Monte Carlo tree search in AlphaGo.
Pachi final23stage of the training pipeline focuses on position evaluation, is selected from state
a, Each simulation
estimating a
(s, a)
value
of theisvpevaluated
function
search
(s) that
tree stores
predicts
in two ways:the
an action
outcome from
using the value
value
posi-
network
Q(s, a), visit count
vθ; and by running
a ts=
t
argmax(Q(s t , a ) + u(s t , a ))
N(s,
a a),
24 ,P(s, Ta).
ram,
e tree byFuego . edge with maximum
selecting the tion s of action
games and
value Q, prior
played by usingvaprobability
p = Eto
(s )policy
rollout [z the
|s tfor
tp =
end sboth
a t…
of the ∼ p]The
game
players with tree
28–30the fastis traversed
rollout by simulation
policy pπ, then
soQasareto updated
maximize a t = argmax (that
(Q(as tbonus
, a ) + u(s t , a ))
r
s u(P) that depends r probabilityrP for that
on a stored prior computing the winner with function r. d, Action values to action value plus
leaf node may be expanded; the new node is processed is, once
Ideally, descending
we wouldtrack like to know
p*(s ) = E[z |the
the
meanthe
tree
value
in
optimal complete
value function
of all evaluations
games
r(·) andunder without backup), starting
vθ(·) in the subtree below
a
ey network
networks pσ and the output probabilities are stored
r Q(s,a),
r N(s,a) rv p forfrom
perfect as play v
priorv the
(s); root
in practice,
t s
that action.t =
state. s
we , a At
instead
t … T ∼
eachp ] time
estimate the step
value t of
function each
so assimulation,
to maximize an
action action
u(s, a ) ∝
value a
plus
P(s, a )
ta bonus
n
s Ptwo
ocuses ways:
for each
on using thethevalue
position
action. c, At end ofnetwork
evaluation,
a simulation,vθ; and the leafour runningpolicy, using the RL policy network pρ. We approx-
strongest
by node
ρ
1 + N (s, a )
he end of the game with the fast rolloutimate
Ideally, we would istheselected
value likefunction
to knowfrom using state
the aoptimal
value stnetwork
value vfunction under θ,
θ(s) with weights
policy* ppπ, then
redicts the
heconvolutional
winner with outcome
networks,
function wonfrom
11%
d,
r.network posi-
perfect
of games
Action
playv
values
( s v
) ≈(s);
v
θagainst Pachi
Q
in
(
are
s ) ≈ v
23
updated
⁎
practice,
(ρs ) .
(s, a)we
This instead
neural estimate
network has the
a value
similar function
architecture
of the search tree stores an action value Q(s, a),that
tooutputs visitiscount
proportional
N(s, a), to u(sthe
, a ) prior
∝
Pprobability
(s, a ) but decays with
wo ways: using the value
28–30 v pρ
for vour; and
to the by
strongest running
policy network,
policy, usingbut the RL a single
policy prediction
network p instead
. We of
approx-a prob-
for
gainst
an
ndvalue
both
a players
slightly
of theofgame
weaker program,
all evaluations Fuego
r(·) rollout
24θ.
and v (·)ability in the subtree
, then below
and prior a t =ofargmax
probability P(s, a). The(Qweights
(s tby, aregres- repeated
)θ,+ u(leafs t , node visits to encourage exploration.
1 + N When
(s , a ) the traversal reaches a
a )) sL at step L, the leaf node may be expanded. The leaf position
ρ tree is traversed by simulation (that
with the fast imate θthepolicy valuedistribution.
pfunction π
We train
using the weights
a value network the
v value
is, descending the tree in network
(s) with
θ complete games without backup), starting
ement learning of value ≈ v sion on vstate-outcome
⁎ pairs (s, z), usinghas stochastica gradient descent to sL is processed just once by the SL policy network pσ. The output prob-
inner with function r. d, networks
Actionvθ(s ) values pρ
(s )Q ≈are ) . This neural
(supdated to
from thenetwork
root state. Ataeach
similar timearchitecture
step t of each simulation, an action at to the prior probability but decays wi
minimize the mean squared error (MSE) between the predicted value that is proportional
a t…
age
alue ofof
Tthe all p] p pipeliner(·)
∼training
evaluations focuses
and on
to the position
vθ(·)policy
in the evaluation,
network,
subtree butbelow
outputs
is selected
vθ(s), and the corresponding outcome z
a single from prediction
state st instead of a prob- abilities are stored as prior probabilities P for each legal action a,
repeated visits to encourage exploration. When the traversal reache
a value function v (s) that predicts the outcome
ability distribution.so
fromas We to
posi- maximize
train the weightsaction of the value valuenetwork plus a
by regres-bonus leaf
P(s, a ) = pσ (a|s ) . The leaf node is evaluated in two very different ways:
node s L value L,
at step the leafv node may be expanded. The leaf positi
earch
mes treeby
played stores
usingan action
policy p forvalue
both Q(s,
players a), visit count
28–30 N(s, a),using a t = argmax (Qdescent
(s t , a ) +to first, by the
u(s t , saL))is network θ(sL); and second, by the outcome zL of a
optimal value function traversed sion on state-outcome
under by simulation (that pairs (s, z), ∂ v θ( stochastic
s ) gradient random rollout played out until terminal network
processed just once by the SL policy step T using pσ. the
Thefast
output pro
rollout
obability P(s, a). The tree isminimize the mean squared error ∆θ ∝(MSE) between (z − vθ(s )) the apredicted value
∂θ abilities are stored as prior probabilities
P(s, a ) policy pπ; these evaluations are combined, using a mixing parameter P for each legal action
ad
ng the estimate v p(s )in
tree = the
E[z t|svalue
completet = s, a tfunction
games pwithout
vθ∼(s),]and the corresponding
backup), starting outcome z
ch tree stores antime
action value
…T
Q(s,simulation,
a),The visit count so
N(s,ofa),as to maximize u
action( s , a
value ) ∝plus a bonus P(sλ,, ainto
) = apσleaf s ) . The leafV(s
(a|evaluation node
L) is evaluated in two very different wa
ot
we policy
state.
wouldAt network
likeeach
to know pthe. We
ρstep approx-
t of
optimal each
value sisting
function
naive approach
under an action predicting
at game outcomes from data con- first, by the
1 + N ( s , a ) value network vθ(sL); and second, by the outcome zL o
ability P(s,s a). Thewith tree weights
isestimate
traversed by simulation
of complete games (that
∂vθ(sleads ) to overfitting. The problem is that random rollout played V (outsL ) = λ )vθ(sL ) +step
(1 −terminal λzLT using the fast rollo
yomnetwork
state
v*(s); t vθ(s)we
in practice, instead θ,successive
the value function ∆θ ∝are strongly
positions (z correlated,
− vθ(s )) differing by justPone(s, astone,
) until
he tree inacomplete games without u ( s , a ) ∝
work
strongest has policy, similar
using the RL architecture
policy network but .backup),
pρthe approx- starting
Weregression target∂isθ shared for the entire game. When s, a ) policy
1 + N (trained pπ; these evaluations are combined, using a mixing parame
ate. At
value each
function time
using astep
value t of
networkeach vsimulation,
(s) with that
weights an isθ, proportional
action a to the prior probability λ, into
At the but
end
a leafedgesevaluationdecays
of simulation,
V(sLEach) with
the action values and visit counts of all
a = argmax ( Q ( s , a ) + u (θs ,ona ))the KGS data set in this t way, the value network memorized the traversed are updated. edge accumulates the visit count and
(le prediction instead t of a prob-
hasThe naive approach of predicting game outcomes from data con- When
t t
sstate
)≈ v ⁎s(s ) . This neural
Selection: a network a similar repeated
gamearchitecture
outcomes rather
Expansion: visits
that than is to encourage
generalizing
proportional to new to exploration.
positions,
the prior achieving
probability a mean but thedecaystraversal
evaluation with
of all reaches
simulations a
passing through that edge
yof thet value
network, network
but outputs a singleby sisting
regres-
prediction of complete
instead
minimum of
leaf a games
prob-
node
MSE of leads
s
0.37 at
on
repeated
to test
step
the
L correlated,
overfitting.
visits L,
set, tothe
compared Thetoproblem
leaf
encourage node
0.19 on may
the
exploration.
is training
that
be
When expanded.
the traversal The
reaches V
leaf
a ( s L ) =
position( 1 − λ )vθ(sL ) + λzL
successive positions
set. To bymitigateare strongly
thisleaf
problem, differing by just
a Evaluation: one stone,
ibution.action We trainvalue the weights ofbonus
the value network regres- node swe generated new self-play data set n
stochastic
imize a t = argmax gradient
pairs (s, (z),
plus
Qusing descent
a
(s t , astochastic to
) + u(s t ,gradient
but the aconsisting Initiation
s
)) descent
regression is processed
target
L of 30tomillion isof
shareda just
for L at step L, the leaf
theonce entire by the
game.
node
SL
When
may be
policy
trained
expanded.
network The leaf position
p . The Backupi=prob-
Noutput
(s , a ) = ∑ (Q 1(s, a, i )
and
actionN):
te-outcome sL isdistinct
processed positions,
just once each bysampled
the SL policy a sepa- pAt
from network the end σ of simulation,
σ. The output prob-
the values and visit counts of
E) between
he mean squared a the predicted
error (MSE) between onvalue
thetheKGSrate new
data
game.
predicted set node
abilities are
Eachin
value this
game with
way,
was
abilities the
stored
played value
between
are stored network
as prior the
as priorRL memorized
Randomprobabilities
policy network
probabilities the
rollout P forPeach
and forlegal
traversed each
edges action legal
are a, action
updated.
1
1 a,
Each edge
n accumulates the visit count a
he corresponding outcome z P(sgame , a ) outcomes
itself until the game
rather thanPterminated.
generalizing Training
to new on this data achieving
positions, set led to MSEs a mean evaluation Q
of ( s
all, a ) =
simulations ∑ 1(
passings , a , i )V
through( s i
) that edge
where u(s, a ) ∝ minimum ofMSE Pof
0.226 s0.37
(and , a0.234
)on=the pσ(the
on
s,aa|)s=) p.σThe
(test training
set,
(a|s ) . leaf
The leaf
and test to
compared setnode nodeis
played
respectively,
0.19 on theevaluated
indicating in two very differentNways:
is evaluated
until
training
in two very different ways: (s, a ) i =1
L
ze action value plus a1bonus + N (s, a ) minimal overfitting. first, by the value network vevaluationθ(sL); and second, by the outcome zL of a
∂vθ(s )
(z − vθ(set.s )) To mitigate
first, by the
this problem, Figurevalue
we 2b network
shows
generated theaposition
new vterminal
(s
self-play
θuntil L); and
data second,
accuracy
step set T. T bysithe outcome znLtheofithasimulation, and 1(s, a, i) indicates
∆θ ∝ of the value network, compared to Monte Carlo rollouts using the fast using
random rollout played out terminal step wherethe fast rollout
L is the leaf N node
(s , a ) from
= ∑ 1(s, a, i )
vθ(s )) to the
−portional ∂θ consisting of 30
P(s, probability
prior arate
) game.rollout random
but
million distinct
policy decays rollout
pπ; the policy
with
value πplayed
positions,
pfunction
; these was each out until
sampled
evaluations leaf
consistently terminal
arefrom
node
more
a sepa-using
combined, update
accurate. step T using
awhether
mixing parameter
an edge the (s, a)fastwas rollout
traversed
i=1 during the ith simulation. Once
u ( s , a ) ∝ Each game was played
λ, into between
a leaf the RL policy
evaluation V(s ) network and
itsapproach
ve to encourage exploration.
of predicting game outcomes When A
the policy
single evaluation
thetraversal
from data con- preaches
π; these
of v θ (s) evaluations
also
apolicy approached the
as are
L
combined,
accuracy
to MSEs using
of Monte the
fromOnce
a mixing
search is complete,
search parameter
the algorithm n
1 the most chooses the most visited move
1 + N ( sitself
, a ) untilCarlo game
rollouts terminated.
using the RLTraining on thispdata
network set led
ρ, but using 15,000 times the root Q(s, is
position. a )done,
= ∑ 1(s, visited
a, i )V (s iL)
omplete games
at step L, the leaf node may beleads to overfitting. The
expanded.
of 0.226 problem
and λ,The
0.234 into
less computation.
is that
onleaftheaposition
leaf evaluation
training V(s
V (sLL) )= (1 − indicating
and test set respectively, λ )vθ(sL ) + λzL It is move
worthis chosen
noting that the NSL (spolicy
(including, a ) i =network
1AlphaZero stocastically
pσ performed better in
ame
positions outcomes
are strongly from correlated, data
ed just once by the SL policy network pσ. The output prob- con-
differing
minimal by just one stone,
overfitting. Figure 2b shows the position evaluation accuracy AlphaGo with than the stronger
a temperature parameter and RL policy network p ρ MuZero) because
, presumably
tional
ression
erfitting. to
target the
is shared
The priorfor the
problem probability
entire
ofisthegame.
that but
When
Searching decays
trained
with with
policyAt theand end of
value simulation,
networks the action values and
where visit
humans s i counts
is the
select of
leaf
a all
node
diverse from
beam the
of ith simulation,
promising moves, and 1(s, a,RL
whereas i) indica
opti-
stored setas in prior probabilities Pvalue network,
formemorized
each legal compared
action to a,Monte Carlo rollouts1using
V (updated.
snetworks
L ) = (Each −anedge
λ the fast
)vaccumulates
sL ) +
θ(algo- zLvisit
λthe L
oaS |data
encourage
s )Silver,
thisexploration.
David,
way, the valueWhen
et is
al.evaluated
"Mastering
network
rollout the
the game
in policy traversal
AlphaGo pπ; the
the
combines reaches
of value
Go traversed
the
with
function a
policy
deepwas edges
and
neural are
value networks
consistently in
more and MCTS mizes
whether
tree search." Nature
accurate. for
an count
the
edge and
single
(s, a) best
was
529.7587 (2016): 484-489. move.
traversed However,
during thethe ithvalue function
simulation. On
ted,
omes differing
.rather
The leaf thannode by just
generalizing one
to new stone, two
positions,
rithm very(Fig. different
achieving a selects
3) that ways:
mean actions byof
evaluation lookahead
all simulations search. Each edge
passing through v s
that
( ) ≈ v
edgepρ
( s ) derived from the stronger chooses
RL policythe network performed
ep L, the leaf node may beAexpanded. single evaluation The leaf of vposition
θ(s) also approached the accuracy of Monte the θsearch is complete, the algorithm most visited mo
game winner from the perspective of the current player at step t. In parallel (Figu
ework
AlphaZero: without human knowledge
drops parameters ✓i arethreshold,
below a resignation trainedorfrom
whendata (s, ⇡ , z) sampled uniformly among all ti
the game
last scored
then iteration(s)
to give a of self-play.
final reward of rThe
T 2 { neural
1, +1} network
(see (p, v) = f✓i (s) is adjusted to m
• Self-play
h time-step t is stored as (st , ⇡ t , zt ) where zt = ±rT is
or between the predicted value v and the self-play winner z, and to maximise the s
f the current player at step t. In parallel (Figure 1b), new
neural network move probabilities p to the search probabilities ⇡ . Specifically,
data (s, ⇡ , z) sampled uniformly among all time-steps of
sural
✓ are adjusted
network (p, v) by
= fgradient descent on a loss function l that sums over mean-squar
✓ (s) is adjusted to minimise the
i
ss-entropy lossesz,respectively,
the self-play winner and to maximise the similarity of
• (A)1:MCTS
Figure with aacting,
Planning, learned andtransition,
training with(B)aSelf-play, and (C(A)
learned model. ) Build up a hidden
How MuZero uses itsstate
models to plan.
k
The to represent
model the
consists of state
three and base
connected on it toforbuild
components a transition
representation, model
dynamics and(to run MCTS
prediction. in a(A))
Given previous
andstate
hidden also
s make
k 1
and aprediction about
candidate action a ,the
k
value and
the dynamics optimise
function the policy
g produces an immediate reward r and a new
k
hidden state s . The policy p and value function v are computed from the hidden state sk by a prediction function
k k k