0% found this document useful (0 votes)
6 views

11 ML Reinforcement Learning Prediction

The document discusses Markov Reward Processes (MRPs) and their relation to Markov Decision Processes (MDPs), emphasizing the value estimation methods such as Monte-Carlo, Dynamic Programming, and Temporal Difference. It explains the Bellman operator and its contractive nature, which leads to the convergence of value functions through iterative applications. Additionally, it highlights the advantages of Temporal Difference methods over traditional Dynamic Programming, particularly in terms of variance and efficiency in continuous state-action spaces.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

11 ML Reinforcement Learning Prediction

The document discusses Markov Reward Processes (MRPs) and their relation to Markov Decision Processes (MDPs), emphasizing the value estimation methods such as Monte-Carlo, Dynamic Programming, and Temporal Difference. It explains the Bellman operator and its contractive nature, which leads to the convergence of value functions through iterative applications. Additionally, it highlights the advantages of Temporal Difference methods over traditional Dynamic Programming, particularly in terms of variance and efficiency in continuous state-action spaces.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Overview

1. MRP vs MDP
2. Value Function Estimators

Monte-Carlo Estimation
Dynamic Programming
Temporal Difference

Markov Reward Processes


Markov reward processes are composed by
A set of states
A transition function
A reward function
A discount factor
A starting-state distribution

Markov Reward Process


Given fixed policy and a MDP can be seen as a MRP where

Notice that the value of each state in the MRP is equivalent to the value in the original MDP
for the fixed policy .

MRPs offer a convenient vector formalization for computing .

Investor MRP

Markov Reward Processes


In the following, we will work a lot with MRPs and the associated vector representation and
show results about .
MRP simplify the theory hiding the effect of the "decision".
However, all reasoning made for MRPs are substantially identical for MDPs and -
functions, just the vector notation is more complicated.

1 / 13
How to estimate the average return of each state?

Value Estimation
Main (sample based) estimators:
Monte-Carlo
Temporal Difference
Temporal Difference with Function Approximation (next lecture)
To understand temporal difference, we will need first to understand

the closed-form solution


dynamic programming
bootstrapping
online empirical averages

Monte-Carlo Estimation
Main Objective: Estimate and/or .
Reminder:
is the averaged discounted return of the policy starting from state .
is the averaged discounted return of the policy starting from state and
action .

Monte-Carlo Estimation
Main Idea: We can use the definition of - and -values to derive a maximum likelihood
unbiased estimator.
We "run" the MDP starting from state times, and average the returns. Each run is called
episode.

2 / 13
Monte-Carlo Estimation
In practice, due to limited time, we need to truncate the episodes, introducing a (little) bias

Monte-Carlo Estimation

Properties:
Maximum likelihood. We have seen now many times that the empirical average is a
maximum likelihood estimator of the expected value.
(Almost) Unbiased. The empirical average is unbiased - bias due to truncation.
High Variance. Monte-Carlo estimators suffer high variance.
Works also for continuous MDPs.
Works also when the Markov assumption is not met.
Advanced question: How can I estimate when I have data collected with another policy
off-policy evaluation (see Sutton 103)

Beyond Monte-Carlo Estimation


Monte-Carlo has too high variance. Can we do better?

3 / 13
Bellman Equations for Policy Evaluation

Scalar Bellman Equations

Vector Bellman Equations

(the vector equation for is very similar, but complicated by some mathematical notation
details)

Closed-Form Solution of the Bellman Equation

4 / 13
Closed-Form Solution

Advanced Question (Extra Chocolate!): Verify that the value function is the sum of
discounted rewards, starting from the closed-form solution and using the Neumann serie.
Send me an email before the next lecture.

Why Dynamic Programming


The closed-form solution requires
discrete state-action space
knowledge of the model
matrix inversion (computationally intensive)
Dynamic programming still requires discrete state-action space and the knowledge of the
model BUT

it is an approximation (allows for trade-off computation with precision)


sets up the mathematical framework that will allow to perform model-free
estimation in continuous state-action spaces
Understanding dynamic programming is super important!

A Perspective on Operators
The whole idea of dynamic programming is based on an operator, called Bellman operator,
that takes as input a vector and returns another vector:

When we repeatedly apply the Bellman operator to a vector, we will converge to , i.e.,

To understand how this math works, we first need to understand the Banach Theorem.

Banach's Theorem
Banach's Theorem: Consider a vector space equipped with a metric . Consider an operator
where is the dimension of the vector space.
If the operator is contractive w.r.t. , i.e.,

with .
Then has a unique fixed point , and

5 / 13
where denotes the iterative application of .

The Bellman Operator and Its Contractivity


Consider the Bellman operator:

(vector definition)
The Bellman operator is contractive under the -norm,

which means,

Proof of Contractivity

Dynamic Programming
The (unique) fixed point of the Bellman operator satisfies the Bellman equation,

thus, the fixed point of the Bellman operator is the value function .

Thanks to the Banach theorem, we can state that the iterate application of the
Bellman operator to a vector converges to the value function , i.e.,

6 / 13
Tabular Dynamic Programming

(Tabular) DP: Value Function - Scalar Definition


1: Input: , the policy to be evaluated
2: Initialize:
3: for do

4: for each do
5:

6: end for
7: end for

8: Return

Tabular Dynamic Programming


(Tabular) DP: Value Function - Vector Definition
1: Input: , the policy to be evaluated
2: Initialize:

3: for do
4:
5: end for

6: Return

Tabular Dynamic Programming

(Tabular) DP: Q-Function - Scalar Definition


1: Input: , the policy to be evaluated
2: Initialize:
3: for do

7 / 13
4: for each do
5:

6: end for

7: end for
8: Return

Summary of Dynamic Programming


The closed-form solution requires a fixed amount of computation. With dynamic
programming we can decide how much computation to use to approximate the value
function. In the limit of infinite computation, we will get the true value function, but with
little computation, we can usually still obtain a very good approximation.
However, we did not unlock yet the whole potential of the recursion induced by the Bellman
equation. Dynamic programming is still model based (requires a model of reward and
transitions) - therefore is not sample-based, and requires discrete state-action space.

Summary of Dynamic Programming


Temporal-difference is a sample-based estimator. Temporal-difference can be seen as a
modification of dynamic programming where the value functions are estimated with
bootstrapping and online average estimator.

Both dynamic programming and temporal-difference can be modified by introducing a


functional approximation that allows working with continuous state-action spaces.

Main Idea of Temporal Difference


The main idea of temporal difference is to use samples from the environment (MDP) to
continuously improve the value's estimate.
1: Input: , number of episodes
2: A table of values initialized with zeros

3: for Episodes do
4: Sample first state

5: for Single episode do


6: Sample action
7: Apply on the environment and receive reward and next state

8: Use to update the value function

8 / 13
9:
10: end for

11: end for


To do that, we need two ingredients: bootstrapping and online empirical averages.

Bootstrapping
In dynamic programming, we needed two tables: and , and we updated .
The idea of bootstrapping is to have only one table, and update each cell based on the
current value of the tables. Bootstrapping is more efficient, since it continuously improve
the value estimate.

(Tabular) Dynamic Programming with Bootstrapping


(Online Sampling)
1: Input: , number of episodes

2: A table of values initialized with zeros


3: for Episodes do
4: Sample first state

5: for Single episode do


6: Sample action
7:

8:

9: end for
10: end for

Online Empirical Average


The model is used to compute an average, i.e.,

Main Idea: We can estimate the Bellman update by using online empirical averages.
Online Empirical Average. can be computed online:

1: Input:

9 / 13
2: Initialize:
3: for do

4:

5:
6: end for

Temporal Difference
Equivalently,

1: Input:
2: Initialize:
3: for do

4:

5:
6:
7: end for

We call learning rate.

Temporal Difference
We can use the online averaging in place of the exact expectation of dynamic
programming, i.e.,

and

where and .

10 / 13
Temporal Difference
Main Idea: unifying bootstrapping and online empirical averages, we obtain the temporal-
difference algorithm

(Tabular) Temporal Difference for Value Functions


1: Input: , number of episodes
2: Initialize: a table of state visitation , a table of values initialized with zeros

3: for Episodes do
4: Sample first state
5: for Single episode do

6: Sample action
7:

8: Apply on the environment and receive reward and next state

9:
10:
11: end for

12: end for

Temporal Difference

(Tabular) Temporal Difference for Q-Functions


1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do

4: Sample first state and


5: for Single episode do
6:

7: Apply a on the environment and receive reward and next state


8: Sample action

11 / 13
9:
10:

11: end for


12: end for

Learning Rate
In the presented algorithms, the learning rate is state(-action) dependent. Keeping
state-action dependent is an optimal choice, but in practical implementations, is a global
variable independent from state and actions. One of the reason is that in continuous state-
action spaces, estimating is difficult.
Having a "global" learning rate makes that decreases in time is also inconvenient: states
that are far from the initial state, will be updated only using low learning rate.

Most state-of-the-art algorithms use a global, constant learning rate fixed a priori.

Temporal Difference
In the limit of infinite visitation of each state-action pair, temporal difference is consistent
(i.e., it converges to the true value-function or -function with probability one).
Temporal difference has drastically lower variance than Monte-Carlo. However, when
combined to function approximation becomes biased.
Temporal difference is the foundation of all value-based and actor critic methods. Recent
research shows that even policy gradients can be understood in terms of temporal
difference (Tosatto et al., 2021 ()).

Unsupervised Validation Set


Some open questions to check your own understanding:

Name three estimators of the value function


Write the Bellman equation and name its components
What is the Bellman operator? What does it mean that it is contractive?
Explain the Banach theorem, and its implication to the understanding of dynamic
programming.
Explain dynamic programming.
Explain bootstrapping. Why DP+Bootstrapping is more computationally efficient
than DP?
Explain online averages.
What are the advantages of TD w.r.t. DP?
Describe the temporal difference error.
Describe the temporal difference

12 / 13
13 / 13

You might also like