0% found this document useful (0 votes)

9 views12 pages

12 ML Reinforcement Learning Value Based Control

The document discusses various methods of Approximate Dynamic Programming (ADP) and policy improvement techniques in reinforcement learning, including Q-Learning and SARSA. It emphasizes the importance of function approximation for handling infinite state spaces and the convergence properties of these methods. Additionally, it covers the implementation of Deep Q-Networks (DQN) to stabilize learning through target functions and replay buffers.

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views12 pages

12 ML Reinforcement Learning Value Based Control

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Overview

1. Function Approximation
Approximate Dynamic Programming
Does ADP converge?

2. Policy Improvement
Policy Improvement Theorem
Policy Iteration
Value Iteration
Q-Learning
SARSA
Deep Networks

Recap: Dynamic Programming

Remember the Bellman update?

Here, and are tables. How can we represent value functions that have infinitely
many states?

Dynamic Programming with Function Approximation

Two ideas:
values can be represented as parametric functions, i.e., (where is a
parameter vector)
since we have infinite states, we need to operate with samples

Dynamic Programming with Function Approximation

Values can be parametrized, i.e.,

Linear parametrization

Neural networks

1 / 12
Dynamic Programming with Function Approximation
Assume you have a dataset of states , actions , rewards , next states and next
actions where and .

Approximated dynamic programming consists of iterate the following equations:

Approximate Dynamic Programming for Policy

Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy

3: for do

4: Minimize using e.g.,

gradient descent
5: Minimize

6: end for
7: Return .

Approximate Dynamic Programming for Policy

Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy

3: for do

4: Minimize using
e.g., gradient descent
5:

2 / 12
6: end for

Gradient Descent Approximate Dynamic Programming

Let us compute the gradient of the optimization problem introduced,

This gradient allows us to write an online version of dynamic programming - thus temporal
difference with function approximation.

Gradient Updates
Let be parametric, i.e., ,

where , and .

Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for

Value Functions
1: Input: , number of episodes , vector parameter
2: for do
3: Sample first state

4: for do
5: Sample action

6: Update learning rate

7: Apply a on the environment and receive reward and next state
8:

9:
10: end for
11: end for

3 / 12
Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for

Action-Value

Functions
1: Input: , number of episodes , vector parameter
2: for do

3: Sample first state

4: Sample action
5: for do

6: Update learning rate

7: Apply a on the environment and receive reward and next state
8: Sample

9:
10:

11: end for

12: end for

Convergence
Approximate dynamic programming and temporal difference with function
approximation converge to a biased solution when the parametrization is linear.
When the function approximation is not linear (e.g., neural networks), ADP and TD with
function approximation are not guaranteed to converge.

In either case, their estimate is biased. Such bias can be mitigated by introducing many
parameters, but many parameters cause high estimation variance, which can be only
compensated by using many samples.
State-of-the-art reinforcement learning tends to use deep neural networks with million (or
even billion) of parameters, and use a vast amount of samples.

4 / 12
Our Objective
Our objective is to find a policy that performs better than any other policy. In
mathematical terms:

We call such policy optimal. Note: there might be several optimal policies.

Policy Improvement Theorem

if , then .
Note: the policy improvement theorem is more general in textbooks, however, this
simplified (and more specific) version, still allow us to derive the following corollaries and
algorithms.

Greedy Policies: A policy is said to be greedy (w.r.t. ) if

Tabular Policy Iteration

1: Create a table that represent deterministic policies, and initialize it randomly
2: for do

3: PolicyEvaluation (e.g, use DP for policy evaluation as introduced the previous

lecture)
4: for do
5:

6: end for
7: end for
Question: does this algorithm converge to the optimal policy? Yes (for tabular and ).

However, it is computational expensive because after each policy improvement we need to

evaluate the policy.

5 / 12
Policy Improvement Theorem
Corollary 1 If exists such that , then
.
Corollary 2 (Optimality Bellman Equation) The optimal -function satisfies the following
optimality Bellman equation:

Can we use the optimality Bellman equation to obtain some guarantees on the
convergence of policy iteration, and to derive a more efficient algorithm?

Optimality Bellman Operator

Consider the optimality Bellman operator

The optimality Bellman operator is contractive, and, thanks to the Banach's theorem, we
can state that

Once we obtain the optimal action-value function , obtaining an optimal policy is trivial:

(Note: such policy is deterministic).

On the Contractivity of the Optimality Bellman Operator

Consider the optimality Bellman operator

6 / 12
Value Iteration

Tabular Value Iteration

1: Create two tables and that represent the -functions, and that represents the
policy
2: for do

3: for do
4:

5: end for
6: for do

7:
8: end for
9: Return

10: end for

Value Iteration
Value iteration is basically dynamic programming with the max operator that selects the
action of the next state.
is model-based (it needs knowledge of the transition model and the reward). Similarly to
what we have seen the previous lecture, we can derive a model-free, online version, called
-learning.

Online Algorithms
Like in the previous lecture, we want to devise an online algorithm, that uses samples to
update the -function and the policy.

Online Algorithm
1: Initialize a -function
2: for Episodes do

3: Sample first state

4: for Single episode do
5: Sample action according to a policy (which policy?)

7 / 12
6: Apply a on the environment and receive reward and next state
7: Use to update the -function

8:
9: end for
10: end for

Q-learning
Like in the previous lecture, we can use the online averaging and bootstrapping, i.e.,

where is the reward and is the next state.

-Greedy Policies
We want to obtain an "online" algorithm: i.e., an algorithm that improves the policy while
interacting with the environment.
A popular strategy is to use -greedy policies: policies that select the greedy action with
probability , and select a sub-optimal action with probability .

Such policies select with high probability (usually epsilon is small) good actions, while
keeping exploring and avoiding local minima. (For the most curious: see the exploration-
exploitation trade-off)

Q-learning

Tabular Q-learning
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do

4: Sample first state

5: for Single episode do
# Greedy Policy

8 / 12
With probability select , otherwise select a randomly.
# Learning Rate Update

Apply on the environment and receive reward and next state

10: # Bellman Update
11:

12: end for

13: end for

Simulation Time
Let's simulate -learning. We use the "investor MDP" as an example (next slide). There are
three states : rich, well-off, poor .
There are two actions : no-invest, 1 : invest .

We assume constant learning rate , discount factor , and epsilon-greedy

. Episodes are truncated after 5 steps.
I need 5 student:
One student execute the MDP (simulate.py)
One student execute the epsilon-greedy policy (epsilon_greedy.py)
One student compute
One student compute the entire update

One student keeps track of the q-table on the blackboard.

Tabular SARSA
1: Input: , number of episodes

2: Initialize: a table of state-actions visitation , a table of -values

initialized with zeros.
3: for Episodes do
4: Sample first state .

5: With probability select , otherwise select randomly.

6: for Single episode do
7: # Learning Rate Update

8: Apply on the environment and receive reward and next state # Greedy Policy

9 / 12
9: With probability select , otherwise select a random action
.
10: # Bellman Update

11:
12: end for

Q-learning

Q-learning with Function approximation

1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros

3: for Episodes do
4: Sample first state
5: for Single episode do

6: with probability select , otherwise select a random action

a.
7: Update the learning rate .

8: Apply a on the environment and receive reward and next state

9:
10: end for

11: end for

SARSA with Function Approximation

1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros.
3: for Episodes do

4: Sample first state .

5: With probability select , otherwise select randomly.
6: for Single episode do

10 / 12
7: Update the learning rate
8: Apply a on the environment and receive reward and next state

9: With probability select , otherwise select randomly.

10:

11: end for

12: end for

SARSA and Q-Learning

SARSA can be seen an online version of policy iteration (the Bellman update only evaluates
the current policy), and the current policy is greedy w.r.t. the q-function.
Q-learning can be seen as an online version of value iteration (the Bellman update
evaluates the greedy policy).

For this reason, SARSA is an on-policy algorithm (i.e., it evaluates the current policy), while
-learning is off-policy, since it evaluates a greedy policy while using an -greedy policy on
the environment.

Deep Q-Network
Q-learning with function approximation tends to be a bit unstable.
DQN aims to mitigate those instabilities by 1) introducing a target q-function, and 2) by
introducing randomized mini-batch updates (replay buffer).
Target q-functions. To stabilize learning it is useful to avoid bootstrapping. The idea of DQN
is to keep two separate functions, as in DP. This can be done by having two separate sets
of parameters and , and .
The targets parameters are updates once in a while .

Replay buffer. In classic -learning, samples are very correlated, as they are obtained by
running the MDP. Using non i.i.d. samples with function approximation is problematic. The
idea of the replay buffer is to store the last samples, and to sample each time a minibatch
of i.i.d. samples from it.

Deep Q-Network

11 / 12
DQN
1: for Episodes do
2: Sample first state

3: for Single episode do

4: With probability select , otherwise select randomly.
5: Update the learning rate .

6: Apply a on the environment and receive reward and next state

7: Append to .
8: Sample a minimatch

10: end for

11: Every episodes, # Target update

12: end for

12 / 12

Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Module 04
No ratings yet
Module 04
63 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Lec 09
No ratings yet
Lec 09
51 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Notes
No ratings yet
Notes
6 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Ul Brand Test Tool User Manual
No ratings yet
Ul Brand Test Tool User Manual
89 pages
CS229
No ratings yet
CS229
17 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
M 2
No ratings yet
M 2
12 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Usability Engineering FinalTerm Paper Spring 2022 Solution 15082022 124653am
No ratings yet
Usability Engineering FinalTerm Paper Spring 2022 Solution 15082022 124653am
7 pages
The Academic Performance of Grade 10 Mathematics Learners Exposed To Hybrid Learning and Printed Modular Distance Learning in San Andres District, Schools Division of Quezon
No ratings yet
The Academic Performance of Grade 10 Mathematics Learners Exposed To Hybrid Learning and Printed Modular Distance Learning in San Andres District, Schools Division of Quezon
9 pages
STM 32 H 757 Xi
No ratings yet
STM 32 H 757 Xi
250 pages
LAS 03 Illustrating A Probability Distribution For A Discrete Random Variable
No ratings yet
LAS 03 Illustrating A Probability Distribution For A Discrete Random Variable
1 page
H 046 010879 00 BeneVision CMS Operators Manual R3 9.0
No ratings yet
H 046 010879 00 BeneVision CMS Operators Manual R3 9.0
182 pages
QP14 15 Informatics Practice XI Paper
No ratings yet
QP14 15 Informatics Practice XI Paper
12 pages
Year 9 Scheme and Note
No ratings yet
Year 9 Scheme and Note
43 pages
Project Book Finish
No ratings yet
Project Book Finish
40 pages
Inverse of A Matrix.01
No ratings yet
Inverse of A Matrix.01
6 pages
Minor Project Report
No ratings yet
Minor Project Report
42 pages
Tanel Poder Oracle Execution Plans
No ratings yet
Tanel Poder Oracle Execution Plans
32 pages
Short Notes Regional Geography
No ratings yet
Short Notes Regional Geography
6 pages
CCS0007 - Laboratory Exercise 3
No ratings yet
CCS0007 - Laboratory Exercise 3
17 pages
Sequence The Activities
No ratings yet
Sequence The Activities
1 page
Student Management System Proposal Slide PDF
No ratings yet
Student Management System Proposal Slide PDF
16 pages
Nitin Report
No ratings yet
Nitin Report
9 pages
MKS Motherboard Raspberry Pi System and Klipper Firmware Upgrade Guide
No ratings yet
MKS Motherboard Raspberry Pi System and Klipper Firmware Upgrade Guide
16 pages
MV1 2023 IDBC Strategy Plan
No ratings yet
MV1 2023 IDBC Strategy Plan
16 pages
Provide Excellent Office Multifunction Printer in UAE - Konica Minolta Dubai
No ratings yet
Provide Excellent Office Multifunction Printer in UAE - Konica Minolta Dubai
4 pages
Noto Sans Korean Font License
No ratings yet
Noto Sans Korean Font License
2 pages
Cyberark Identity Adaptive Multi Factor Authentication Solution Brief
No ratings yet
Cyberark Identity Adaptive Multi Factor Authentication Solution Brief
2 pages
Sound Pool
No ratings yet
Sound Pool
3 pages
Schematic - Zigbee Stick 4.0 CH340C
No ratings yet
Schematic - Zigbee Stick 4.0 CH340C
1 page
SSD Buying Guide
No ratings yet
SSD Buying Guide
1 page
Class8-IIT Screening Test QP Sample Paper
No ratings yet
Class8-IIT Screening Test QP Sample Paper
2 pages
Poster Lifi
No ratings yet
Poster Lifi
1 page
Chapter One Lab-4 - Implement Basic Connectivity
No ratings yet
Chapter One Lab-4 - Implement Basic Connectivity
3 pages
Edge CB JS1B Unit 8 Overview
No ratings yet
Edge CB JS1B Unit 8 Overview
1 page
AC Adaptor For Blood Pressure Monitor / Nebulizer: - US Version
No ratings yet
AC Adaptor For Blood Pressure Monitor / Nebulizer: - US Version
1 page
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet

12 ML Reinforcement Learning Value Based Control

Uploaded by

12 ML Reinforcement Learning Value Based Control

Uploaded by

Overview

Recap: Dynamic Programming

Dynamic Programming with Function Approximation

Dynamic Programming with Function Approximation

Approximated dynamic programming consists of iterate the following equations:

Approximate Dynamic Programming for Policy

4: Minimize using e.g.,

Approximate Dynamic Programming for Policy

Gradient Descent Approximate Dynamic Programming

Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for

6: Update learning rate

Temporal Difference with Function Approximation for

3: Sample first state

6: Update learning rate

11: end for

Policy Improvement Theorem

Policy Improvement Theorem

Greedy Policies: A policy is said to be greedy (w.r.t. ) if

Tabular Policy Iteration

Tabular Policy Iteration

3: PolicyEvaluation (e.g, use DP for policy evaluation as introduced the previous

However, it is computational expensive because after each policy improvement we need to

Optimality Bellman Operator

(Note: such policy is deterministic).

On the Contractivity of the Optimality Bellman Operator

Tabular Value Iteration

10: end for

3: Sample first state

where is the reward and is the next state.

4: Sample first state

Apply on the environment and receive reward and next state

12: end for

We assume constant learning rate , discount factor , and epsilon-greedy

One student keeps track of the q-table on the blackboard.

2: Initialize: a table of state-actions visitation , a table of -values

5: With probability select , otherwise select randomly.

Q-learning with Function approximation

6: with probability select , otherwise select a random action

8: Apply a on the environment and receive reward and next state

11: end for

SARSA with Function Approximation

4: Sample first state .

9: With probability select , otherwise select randomly.

11: end for

SARSA and Q-Learning

3: for Single episode do

6: Apply a on the environment and receive reward and next state

10: end for

11: Every episodes, # Target update

You might also like