0% found this document useful (0 votes)

6 views

11 ML Reinforcement Learning Prediction

The document discusses Markov Reward Processes (MRPs) and their relation to Markov Decision Processes (MDPs), emphasizing the value estimation methods such as Monte-Carlo, Dynamic Programming, and Temporal Difference. It explains the Bellman operator and its contractive nature, which leads to the convergence of value functions through iterative applications. Additionally, it highlights the advantages of Temporal Difference methods over traditional Dynamic Programming, particularly in terms of variance and efficiency in continuous state-action spaces.

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

11 ML Reinforcement Learning Prediction

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Overview

1. MRP vs MDP
2. Value Function Estimators

Monte-Carlo Estimation
Dynamic Programming
Temporal Difference

Markov Reward Processes

Markov reward processes are composed by
A set of states
A transition function
A reward function
A discount factor
A starting-state distribution

Markov Reward Process

Given fixed policy and a MDP can be seen as a MRP where

Notice that the value of each state in the MRP is equivalent to the value in the original MDP
for the fixed policy .

MRPs offer a convenient vector formalization for computing .

Investor MRP

Markov Reward Processes

In the following, we will work a lot with MRPs and the associated vector representation and
show results about .
MRP simplify the theory hiding the effect of the "decision".
However, all reasoning made for MRPs are substantially identical for MDPs and -
functions, just the vector notation is more complicated.

1 / 13
How to estimate the average return of each state?

Value Estimation
Main (sample based) estimators:
Monte-Carlo
Temporal Difference
Temporal Difference with Function Approximation (next lecture)
To understand temporal difference, we will need first to understand

the closed-form solution

dynamic programming
bootstrapping
online empirical averages

Monte-Carlo Estimation
Main Objective: Estimate and/or .
Reminder:
is the averaged discounted return of the policy starting from state .
is the averaged discounted return of the policy starting from state and
action .

Monte-Carlo Estimation
Main Idea: We can use the definition of - and -values to derive a maximum likelihood
unbiased estimator.
We "run" the MDP starting from state times, and average the returns. Each run is called
episode.

2 / 13
Monte-Carlo Estimation
In practice, due to limited time, we need to truncate the episodes, introducing a (little) bias

Monte-Carlo Estimation

Properties:
Maximum likelihood. We have seen now many times that the empirical average is a
maximum likelihood estimator of the expected value.
(Almost) Unbiased. The empirical average is unbiased - bias due to truncation.
High Variance. Monte-Carlo estimators suffer high variance.
Works also for continuous MDPs.
Works also when the Markov assumption is not met.
Advanced question: How can I estimate when I have data collected with another policy
off-policy evaluation (see Sutton 103)

Beyond Monte-Carlo Estimation

Monte-Carlo has too high variance. Can we do better?

3 / 13
Bellman Equations for Policy Evaluation

Scalar Bellman Equations

Vector Bellman Equations

(the vector equation for is very similar, but complicated by some mathematical notation
details)

Closed-Form Solution of the Bellman Equation

4 / 13
Closed-Form Solution

Advanced Question (Extra Chocolate!): Verify that the value function is the sum of
discounted rewards, starting from the closed-form solution and using the Neumann serie.
Send me an email before the next lecture.

Why Dynamic Programming

The closed-form solution requires
discrete state-action space
knowledge of the model
matrix inversion (computationally intensive)
Dynamic programming still requires discrete state-action space and the knowledge of the
model BUT

it is an approximation (allows for trade-off computation with precision)

sets up the mathematical framework that will allow to perform model-free
estimation in continuous state-action spaces
Understanding dynamic programming is super important!

A Perspective on Operators
The whole idea of dynamic programming is based on an operator, called Bellman operator,
that takes as input a vector and returns another vector:

When we repeatedly apply the Bellman operator to a vector, we will converge to , i.e.,

To understand how this math works, we first need to understand the Banach Theorem.

Banach's Theorem
Banach's Theorem: Consider a vector space equipped with a metric . Consider an operator
where is the dimension of the vector space.
If the operator is contractive w.r.t. , i.e.,

with .
Then has a unique fixed point , and

5 / 13
where denotes the iterative application of .

The Bellman Operator and Its Contractivity

Consider the Bellman operator:

(vector definition)
The Bellman operator is contractive under the -norm,

which means,

Proof of Contractivity

Dynamic Programming
The (unique) fixed point of the Bellman operator satisfies the Bellman equation,

thus, the fixed point of the Bellman operator is the value function .

Thanks to the Banach theorem, we can state that the iterate application of the
Bellman operator to a vector converges to the value function , i.e.,

6 / 13
Tabular Dynamic Programming

(Tabular) DP: Value Function - Scalar Definition

1: Input: , the policy to be evaluated
2: Initialize:
3: for do

4: for each do
5:

6: end for
7: end for

8: Return

Tabular Dynamic Programming

(Tabular) DP: Value Function - Vector Definition
1: Input: , the policy to be evaluated
2: Initialize:

3: for do
4:
5: end for

6: Return

Tabular Dynamic Programming

(Tabular) DP: Q-Function - Scalar Definition

1: Input: , the policy to be evaluated
2: Initialize:
3: for do

7 / 13
4: for each do
5:

6: end for

7: end for
8: Return

Summary of Dynamic Programming

The closed-form solution requires a fixed amount of computation. With dynamic
programming we can decide how much computation to use to approximate the value
function. In the limit of infinite computation, we will get the true value function, but with
little computation, we can usually still obtain a very good approximation.
However, we did not unlock yet the whole potential of the recursion induced by the Bellman
equation. Dynamic programming is still model based (requires a model of reward and
transitions) - therefore is not sample-based, and requires discrete state-action space.

Summary of Dynamic Programming

Temporal-difference is a sample-based estimator. Temporal-difference can be seen as a
modification of dynamic programming where the value functions are estimated with
bootstrapping and online average estimator.

Both dynamic programming and temporal-difference can be modified by introducing a

functional approximation that allows working with continuous state-action spaces.

Main Idea of Temporal Difference

The main idea of temporal difference is to use samples from the environment (MDP) to
continuously improve the value's estimate.
1: Input: , number of episodes
2: A table of values initialized with zeros

3: for Episodes do
4: Sample first state

5: for Single episode do

6: Sample action
7: Apply on the environment and receive reward and next state

8: Use to update the value function

8 / 13
9:
10: end for

11: end for

To do that, we need two ingredients: bootstrapping and online empirical averages.

Bootstrapping
In dynamic programming, we needed two tables: and , and we updated .
The idea of bootstrapping is to have only one table, and update each cell based on the
current value of the tables. Bootstrapping is more efficient, since it continuously improve
the value estimate.

(Tabular) Dynamic Programming with Bootstrapping

(Online Sampling)
1: Input: , number of episodes

2: A table of values initialized with zeros

3: for Episodes do
4: Sample first state

5: for Single episode do

6: Sample action
7:

9: end for
10: end for

Online Empirical Average

The model is used to compute an average, i.e.,

Main Idea: We can estimate the Bellman update by using online empirical averages.
Online Empirical Average. can be computed online:

1: Input:

9 / 13
2: Initialize:
3: for do

5:
6: end for

Temporal Difference
Equivalently,

1: Input:
2: Initialize:
3: for do

5:
6:
7: end for

We call learning rate.

Temporal Difference
We can use the online averaging in place of the exact expectation of dynamic
programming, i.e.,

and

where and .

10 / 13
Temporal Difference
Main Idea: unifying bootstrapping and online empirical averages, we obtain the temporal-
difference algorithm

(Tabular) Temporal Difference for Value Functions

1: Input: , number of episodes
2: Initialize: a table of state visitation , a table of values initialized with zeros

3: for Episodes do
4: Sample first state
5: for Single episode do

6: Sample action
7:

8: Apply on the environment and receive reward and next state

9:
10:
11: end for

12: end for

Temporal Difference

(Tabular) Temporal Difference for Q-Functions

1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do

4: Sample first state and

5: for Single episode do
6:

7: Apply a on the environment and receive reward and next state

8: Sample action

11 / 13
9:
10:

11: end for

12: end for

Learning Rate
In the presented algorithms, the learning rate is state(-action) dependent. Keeping
state-action dependent is an optimal choice, but in practical implementations, is a global
variable independent from state and actions. One of the reason is that in continuous state-
action spaces, estimating is difficult.
Having a "global" learning rate makes that decreases in time is also inconvenient: states
that are far from the initial state, will be updated only using low learning rate.

Most state-of-the-art algorithms use a global, constant learning rate fixed a priori.

Temporal Difference
In the limit of infinite visitation of each state-action pair, temporal difference is consistent
(i.e., it converges to the true value-function or -function with probability one).
Temporal difference has drastically lower variance than Monte-Carlo. However, when
combined to function approximation becomes biased.
Temporal difference is the foundation of all value-based and actor critic methods. Recent
research shows that even policy gradients can be understood in terms of temporal
difference (Tosatto et al., 2021 ()).

Unsupervised Validation Set

Some open questions to check your own understanding:

Name three estimators of the value function

Write the Bellman equation and name its components
What is the Bellman operator? What does it mean that it is contractive?
Explain the Banach theorem, and its implication to the understanding of dynamic
programming.
Explain dynamic programming.
Explain bootstrapping. Why DP+Bootstrapping is more computationally efficient
than DP?
Explain online averages.
What are the advantages of TD w.r.t. DP?
Describe the temporal difference error.
Describe the temporal difference

12 / 13
13 / 13

Planning: Speech: Levelt's Model of L1 Production
No ratings yet
Planning: Speech: Levelt's Model of L1 Production
22 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Handout 10 Dynamic Programming Nov14
No ratings yet
Handout 10 Dynamic Programming Nov14
113 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
AS02
No ratings yet
AS02
16 pages
Rust J. - Numerical Dynamic Programming in Economics
No ratings yet
Rust J. - Numerical Dynamic Programming in Economics
167 pages
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
No ratings yet
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
11 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
RL
No ratings yet
RL
9 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
Lecture SM 1 DP
No ratings yet
Lecture SM 1 DP
71 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
MIT15 450F10 Lec05
No ratings yet
MIT15 450F10 Lec05
32 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
13 - Econometric Applications of Dynamic Programming: V X Uxi EV Xi
No ratings yet
13 - Econometric Applications of Dynamic Programming: V X Uxi EV Xi
5 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Robust_Dynamic_Programming
No ratings yet
Robust_Dynamic_Programming
30 pages
Internal
No ratings yet
Internal
25 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
notes
No ratings yet
notes
6 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Module 04
No ratings yet
Module 04
63 pages
Rl Lecture4
No ratings yet
Rl Lecture4
16 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Bellman Filtering and Smoothing For State-Space Models
No ratings yet
Bellman Filtering and Smoothing For State-Space Models
60 pages
Notas - Dynamic Optimation and Optimal Control
No ratings yet
Notas - Dynamic Optimation and Optimal Control
26 pages
Dynamic Optimization
No ratings yet
Dynamic Optimization
73 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
002 2012 Intro To Optimal Control
No ratings yet
002 2012 Intro To Optimal Control
53 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
distributional_RL_Remi_Munos
No ratings yet
distributional_RL_Remi_Munos
71 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
2023_week3_modelfree
No ratings yet
2023_week3_modelfree
63 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
Powell-Tutorial-ComputationalStochasticOptimization Informs Nov152014
No ratings yet
Powell-Tutorial-ComputationalStochasticOptimization Informs Nov152014
142 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
MDP 2
No ratings yet
MDP 2
53 pages
Model Free Prediction (2)
No ratings yet
Model Free Prediction (2)
38 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Model free methods
No ratings yet
Model free methods
31 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Dynamic Programming: Quantitative Macroeconomics (Econ 5725)
No ratings yet
Dynamic Programming: Quantitative Macroeconomics (Econ 5725)
55 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
The Lieutenant by Kate Grenville
No ratings yet
The Lieutenant by Kate Grenville
1 page
Neurological Basis of Emotion, Learning and Memory
No ratings yet
Neurological Basis of Emotion, Learning and Memory
6 pages
My Contribution For Indonesia (LPDP)
No ratings yet
My Contribution For Indonesia (LPDP)
1 page
Department of American Studies 2013 Newsletter
No ratings yet
Department of American Studies 2013 Newsletter
24 pages
Practicum Portfolio For Teachers 1
No ratings yet
Practicum Portfolio For Teachers 1
17 pages
MAV Folder Master 2011 E
No ratings yet
MAV Folder Master 2011 E
2 pages
King Fahd University of Petroleum & Minerals Mechanical Engineering Department Thermodynamics I - ME 203
No ratings yet
King Fahd University of Petroleum & Minerals Mechanical Engineering Department Thermodynamics I - ME 203
1 page
Resource-Oriented Perspectives in Music Therapy
No ratings yet
Resource-Oriented Perspectives in Music Therapy
26 pages
वक्रतुंड महाकाय कोटिसूर्यसमप्रभ Shri Ganesh Mantra
No ratings yet
वक्रतुंड महाकाय कोटिसूर्यसमप्रभ Shri Ganesh Mantra
2 pages
Modal Verbs
No ratings yet
Modal Verbs
15 pages
John Vinard O. Sualog: Garcia College of Technology Kalibo, Aklan
No ratings yet
John Vinard O. Sualog: Garcia College of Technology Kalibo, Aklan
9 pages
Lesson 1
No ratings yet
Lesson 1
57 pages
What Makes A Good Journalist
No ratings yet
What Makes A Good Journalist
3 pages
Edu214 Your Classroom Reflection Felicia Carvalho
No ratings yet
Edu214 Your Classroom Reflection Felicia Carvalho
2 pages
HRMS 2
No ratings yet
HRMS 2
2 pages
Cas Grade 9 t1 Lesson Plans
No ratings yet
Cas Grade 9 t1 Lesson Plans
5 pages
English For Academic and Professional Purposes: Module 4 - Quarter 1
50% (4)
English For Academic and Professional Purposes: Module 4 - Quarter 1
5 pages
Using The Variety of Informative, Persuasive, and Argumentative Writing Techniques
No ratings yet
Using The Variety of Informative, Persuasive, and Argumentative Writing Techniques
48 pages
Resume Ananya
No ratings yet
Resume Ananya
2 pages
Interviewing-What Is It?
No ratings yet
Interviewing-What Is It?
34 pages
Full (3)
No ratings yet
Full (3)
96 pages
Dimostration in Metallic Alchemy - AATM CHETNA
No ratings yet
Dimostration in Metallic Alchemy - AATM CHETNA
8 pages
CIE-AZ Teacher Preparation Lesson Unit Plan Template
No ratings yet
CIE-AZ Teacher Preparation Lesson Unit Plan Template
8 pages
8 Classroom Block ARCH
No ratings yet
8 Classroom Block ARCH
21 pages
Lecture 10 Models of Memory 22112023 084452am
No ratings yet
Lecture 10 Models of Memory 22112023 084452am
14 pages
Preoccupied Attachment: Self Report
No ratings yet
Preoccupied Attachment: Self Report
9 pages
Confucianism in Korea Ancient and Contemporary
No ratings yet
Confucianism in Korea Ancient and Contemporary
15 pages
Self Management Skill 1
No ratings yet
Self Management Skill 1
18 pages
Performance of Venture Capital Funded Projets in India: A Project Report Under The Guidance of
No ratings yet
Performance of Venture Capital Funded Projets in India: A Project Report Under The Guidance of
5 pages

11 ML Reinforcement Learning Prediction

Uploaded by

11 ML Reinforcement Learning Prediction

Uploaded by

Overview

Markov Reward Processes

Markov Reward Process

MRPs offer a convenient vector formalization for computing .

Markov Reward Processes

the closed-form solution

Beyond Monte-Carlo Estimation

Scalar Bellman Equations

Vector Bellman Equations

Closed-Form Solution of the Bellman Equation

Why Dynamic Programming

it is an approximation (allows for trade-off computation with precision)

The Bellman Operator and Its Contractivity

(Tabular) DP: Value Function - Scalar Definition

Tabular Dynamic Programming

Tabular Dynamic Programming

(Tabular) DP: Q-Function - Scalar Definition

Summary of Dynamic Programming

Summary of Dynamic Programming

Both dynamic programming and temporal-difference can be modified by introducing a

Main Idea of Temporal Difference

5: for Single episode do

8: Use to update the value function

11: end for

(Tabular) Dynamic Programming with Bootstrapping

2: A table of values initialized with zeros

5: for Single episode do

Online Empirical Average

We call learning rate.

(Tabular) Temporal Difference for Value Functions

8: Apply on the environment and receive reward and next state

12: end for

(Tabular) Temporal Difference for Q-Functions

4: Sample first state and

7: Apply a on the environment and receive reward and next state

11: end for

Unsupervised Validation Set

Name three estimators of the value function

You might also like