RL 1
RL 1
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'
Note the dependence
p on neighboring
g g states
Direct Utility Estimation
Using the dependence to your advantage:
Suppose you know that state (3,3) has
a high utility
Suppose you are now at (3,2)
The Bellman equation would be able
to tell you that (3,2) is likely to have a
high utility because (3,3) is a
neighbor.
neighbor
Remember that each blank
state has R(s) = -0.04 DEU can’t tell you that until the end
of the trial
Adaptive Dynamic Programming
(M d l b
(Model based)
d)
• This method does take advantage g of the
constraints in the Bellman equation
• Basicallyy learns the transition model T and
the reward function R
• Based on the underlyingy g MDP ((T and R)) we
can perform policy evaluation (which is
part of policy
p p y iteration p previouslyy taught)
g )
Adaptive Dynamic Programming
• Recall that ppolicyy evaluation in p
policyy
iteration involves solving the utility for each
state if policy πi is followed.
• This leads to the equations:
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'
• The equations above are linear, so they can
be solved with linear algebra in time O(n3)
where n is the number of states
Adaptive Dynamic Programming
• Make use of p policyy evaluation to learn the
utilities of states
• In order to use the policy evaluation eqn:
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ), s ' )U π ( s ' )
s'
• IInstead
t d off doing
d i this
thi sum over all
ll successors, only
l adjust
dj t the
th
utility of the state based on the successor observed in the trial.
• It does not estimate the transition model – model free
TD Learning
Example:
• Suppose you see that Uπ(1,3) = 0.84 and Uπ(2,3) =
0.92 after the first trial.
• If the transition (1 3) → (2,3)
(1,3) (2 3) happens all the
time, you would expect to see:
Uπ(1,3) = R(1,3) + Uπ(2,3)
⇒Uπ(1,3) = ‐0.04 + Uπ(2,3)
⇒ Uπ(1,3) = ‐0.04 + 0.92 = 0.88
• Since
Si you observe
b Uπ(1,3)
(1 3) = 0
0.84
84 iin th
the fi
firstt ttrial,
i l
it is a little lower than 0.88, so you might want to
“bump” it towards 0.88.
Temporal Difference Update
When we move from state s to s’, we apply the
following update rule:
U π ( s ) = U π ( s ) + α ( R( s ) + γU π ( s ' ) − U π ( s ))