Reinforcement Learning 3 Recap
Reinforcement Learning 3 Recap
CS 440/ECE 448
Fall 2020 Reinforcement Learning 3
Margaret Fleck
Recap
Pieces of an MDP
states s in S
actions a in A
transition probabilities P(s' | s,a)
reward function R(s)
policy π(s) returns action
When we're in state s, we command an action π(s). However, our buggy controller may put
us into a variety of choices for the next state s', with probabilities given by the transition
function P.
Policy Iteration
Suppose that we have picked some policy π telling us what move to command in each state.
1 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/rl3...
Then the Bellman equation for this fixed policy is simpler because we know exactly what
action we'll command:
Policy iteration makes the emerging policy values explicit, so they can help guide the
process of refining the utility values.
Policy evaluation
Since we have a draft policy π(s) when doing policy evaluation, we have a simplified
Bellman equation (below).
linear algebra
a few iterations of value iteration
The value estimation approach is usually faster. We don't need an exact (fully converged)
solution, because we'll be repeating this calculation each time we refine our policy π.
2 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/rl3...
3 of 3 5/10/21, 02:02