0% found this document useful (0 votes)
30 views3 pages

Reinforcement Learning 3 Recap

The document discusses reinforcement learning techniques for solving Markov Decision Processes including value iteration, policy iteration, and policy evaluation. Policy iteration alternates between policy evaluation, using the current policy to estimate value functions, and policy improvement, finding a policy based on the current value estimates. Policy evaluation can be done using linear algebra or value iteration when the policy is fixed.

Uploaded by

chuck212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views3 pages

Reinforcement Learning 3 Recap

The document discusses reinforcement learning techniques for solving Markov Decision Processes including value iteration, policy iteration, and policy evaluation. Policy iteration alternates between policy evaluation, using the current policy to estimate value functions, and policy improvement, finding a policy based on the current value estimates. Policy evaluation can be done using linear algebra or value iteration when the policy is fixed.

Uploaded by

chuck212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/rl3...

CS 440/ECE 448
Fall 2020 Reinforcement Learning 3
Margaret Fleck

Recap
Pieces of an MDP

states s in S
actions a in A
transition probabilities P(s' | s,a)
reward function R(s)
policy π(s) returns action

When we're in state s, we command an action π(s). However, our buggy controller may put
us into a variety of choices for the next state s', with probabilities given by the transition
function P.

Bellman equation for optimal policy

U(s) = R(s) + γ maxa∈A ∑s′ ∈S P (s′ |s, a)U(s′ )

Recap: Value iteration


Recall how we solve the Bellman equation using value iteration. Let Ut be the utility values
at iteration step t.

Initialize U0 (s) = 0, for all states s


For i=0 until values converge, update U using the equation

Ui+1 (s) = R(s) + γ maxa∈A ∑s′ ∈S P (s′ |s, a)Ui (s′ )


Then extract the corresponding policy:

π(s) = argmaxa ∑s′ P (s′ |s, a)U(s′ )


Value iteration eventually converges to the solution. Notice that the optimal utility values
are uniquely determined, but there may be one policy consistent with them.

Policy Iteration
Suppose that we have picked some policy π telling us what move to command in each state.

1 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/rl3...

Then the Bellman equation for this fixed policy is simpler because we know exactly what
action we'll command:

Bellman equation for a fixed policy:


U(s) = R(s) + γ ∑s′ ∈S P (s′ |s, π(s))U(s′ )
Because the optimal policy is tightly coupled to the correct utility values, we can rephrase
our optimization problem as finding the best policy. This is "policy iteration". It produces the
same solution as value iteration, but faster.

Specifically, the policy iteration algorithm looks like this:

Start with an initial guess for policy π.


Alternate two steps:
Policy evaluation: use policy π to estimate utility values U
Policy improvement: use utility values U to calculate a new policy π

Policy iteration makes the emerging policy values explicit, so they can help guide the
process of refining the utility values.

The policy improvement step is easy. Just use this equation:

π(s) = argmaxa ∑s′ P (s′ |s, a)U(s′ )


We still need to understand how to do the policy evaluation step.

Policy evaluation
Since we have a draft policy π(s) when doing policy evaluation, we have a simplified
Bellman equation (below).

U(s) = R(s) + γ ∑s′ P (s′ |s, π(s))U(s′ )


We have one of these equations for each state s. The equations are still recursive (like the
original Bellman equation) but they are now linear. So have two options for adjusting our
utility function:

linear algebra
a few iterations of value iteration

The value estimation approach is usually faster. We don't need an exact (fully converged)
solution, because we'll be repeating this calculation each time we refine our policy π.

Asynchronous dynamic programming

2 of 3 5/10/21, 02:02
CS440 Lectures https://courses.grainger.illinois.edu/cs440/fa2020/lectures/rl3...

One useful weak to solving Markov Decision Process is "asynchronous dynamic


programming." In each iteration, it's not necessary to update all states. We can select only
certain states for updating. E.g.

states frequently seen in some application (e.g. a game)


states for which the Bellman equation has a large error (i.e. compare values for left
and right sides of the equation)

The details can be spelled out in a wide variety of ways.

3 of 3 5/10/21, 02:02

You might also like