11 ML Reinforcement Learning Prediction
11 ML Reinforcement Learning Prediction
1. MRP vs MDP
2. Value Function Estimators
Monte-Carlo Estimation
Dynamic Programming
Temporal Difference
Notice that the value of each state in the MRP is equivalent to the value in the original MDP
for the fixed policy .
Investor MRP
1 / 13
How to estimate the average return of each state?
Value Estimation
Main (sample based) estimators:
Monte-Carlo
Temporal Difference
Temporal Difference with Function Approximation (next lecture)
To understand temporal difference, we will need first to understand
Monte-Carlo Estimation
Main Objective: Estimate and/or .
Reminder:
is the averaged discounted return of the policy starting from state .
is the averaged discounted return of the policy starting from state and
action .
Monte-Carlo Estimation
Main Idea: We can use the definition of - and -values to derive a maximum likelihood
unbiased estimator.
We "run" the MDP starting from state times, and average the returns. Each run is called
episode.
2 / 13
Monte-Carlo Estimation
In practice, due to limited time, we need to truncate the episodes, introducing a (little) bias
Monte-Carlo Estimation
Properties:
Maximum likelihood. We have seen now many times that the empirical average is a
maximum likelihood estimator of the expected value.
(Almost) Unbiased. The empirical average is unbiased - bias due to truncation.
High Variance. Monte-Carlo estimators suffer high variance.
Works also for continuous MDPs.
Works also when the Markov assumption is not met.
Advanced question: How can I estimate when I have data collected with another policy
off-policy evaluation (see Sutton 103)
3 / 13
Bellman Equations for Policy Evaluation
(the vector equation for is very similar, but complicated by some mathematical notation
details)
4 / 13
Closed-Form Solution
Advanced Question (Extra Chocolate!): Verify that the value function is the sum of
discounted rewards, starting from the closed-form solution and using the Neumann serie.
Send me an email before the next lecture.
A Perspective on Operators
The whole idea of dynamic programming is based on an operator, called Bellman operator,
that takes as input a vector and returns another vector:
When we repeatedly apply the Bellman operator to a vector, we will converge to , i.e.,
To understand how this math works, we first need to understand the Banach Theorem.
Banach's Theorem
Banach's Theorem: Consider a vector space equipped with a metric . Consider an operator
where is the dimension of the vector space.
If the operator is contractive w.r.t. , i.e.,
with .
Then has a unique fixed point , and
5 / 13
where denotes the iterative application of .
(vector definition)
The Bellman operator is contractive under the -norm,
which means,
Proof of Contractivity
Dynamic Programming
The (unique) fixed point of the Bellman operator satisfies the Bellman equation,
thus, the fixed point of the Bellman operator is the value function .
Thanks to the Banach theorem, we can state that the iterate application of the
Bellman operator to a vector converges to the value function , i.e.,
6 / 13
Tabular Dynamic Programming
4: for each do
5:
6: end for
7: end for
8: Return
3: for do
4:
5: end for
6: Return
7 / 13
4: for each do
5:
6: end for
7: end for
8: Return
3: for Episodes do
4: Sample first state
8 / 13
9:
10: end for
Bootstrapping
In dynamic programming, we needed two tables: and , and we updated .
The idea of bootstrapping is to have only one table, and update each cell based on the
current value of the tables. Bootstrapping is more efficient, since it continuously improve
the value estimate.
8:
9: end for
10: end for
Main Idea: We can estimate the Bellman update by using online empirical averages.
Online Empirical Average. can be computed online:
1: Input:
9 / 13
2: Initialize:
3: for do
4:
5:
6: end for
Temporal Difference
Equivalently,
1: Input:
2: Initialize:
3: for do
4:
5:
6:
7: end for
Temporal Difference
We can use the online averaging in place of the exact expectation of dynamic
programming, i.e.,
and
where and .
10 / 13
Temporal Difference
Main Idea: unifying bootstrapping and online empirical averages, we obtain the temporal-
difference algorithm
3: for Episodes do
4: Sample first state
5: for Single episode do
6: Sample action
7:
9:
10:
11: end for
Temporal Difference
11 / 13
9:
10:
Learning Rate
In the presented algorithms, the learning rate is state(-action) dependent. Keeping
state-action dependent is an optimal choice, but in practical implementations, is a global
variable independent from state and actions. One of the reason is that in continuous state-
action spaces, estimating is difficult.
Having a "global" learning rate makes that decreases in time is also inconvenient: states
that are far from the initial state, will be updated only using low learning rate.
Most state-of-the-art algorithms use a global, constant learning rate fixed a priori.
Temporal Difference
In the limit of infinite visitation of each state-action pair, temporal difference is consistent
(i.e., it converges to the true value-function or -function with probability one).
Temporal difference has drastically lower variance than Monte-Carlo. However, when
combined to function approximation becomes biased.
Temporal difference is the foundation of all value-based and actor critic methods. Recent
research shows that even policy gradients can be understood in terms of temporal
difference (Tosatto et al., 2021 ()).
12 / 13
13 / 13