0% found this document useful (0 votes)
7 views

12 ML Reinforcement Learning Value Based Control

The document discusses various methods of Approximate Dynamic Programming (ADP) and policy improvement techniques in reinforcement learning, including Q-Learning and SARSA. It emphasizes the importance of function approximation for handling infinite state spaces and the convergence properties of these methods. Additionally, it covers the implementation of Deep Q-Networks (DQN) to stabilize learning through target functions and replay buffers.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

12 ML Reinforcement Learning Value Based Control

The document discusses various methods of Approximate Dynamic Programming (ADP) and policy improvement techniques in reinforcement learning, including Q-Learning and SARSA. It emphasizes the importance of function approximation for handling infinite state spaces and the convergence properties of these methods. Additionally, it covers the implementation of Deep Q-Networks (DQN) to stabilize learning through target functions and replay buffers.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Overview

1. Function Approximation
Approximate Dynamic Programming
Does ADP converge?

2. Policy Improvement
Policy Improvement Theorem
Policy Iteration
Value Iteration
Q-Learning
SARSA
Deep Networks

Recap: Dynamic Programming


Remember the Bellman update?

Here, and are tables. How can we represent value functions that have infinitely
many states?

Dynamic Programming with Function Approximation


Two ideas:
values can be represented as parametric functions, i.e., (where is a
parameter vector)
since we have infinite states, we need to operate with samples

Dynamic Programming with Function Approximation


Values can be parametrized, i.e.,

Linear parametrization

Neural networks

1 / 12
Dynamic Programming with Function Approximation
Assume you have a dataset of states , actions , rewards , next states and next
actions where and .

Approximated dynamic programming consists of iterate the following equations:

Approximate Dynamic Programming for Policy


Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy

3: for do

4: Minimize using e.g.,


gradient descent
5: Minimize

6: end for
7: Return .

Approximate Dynamic Programming for Policy


Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy

3: for do

4: Minimize using
e.g., gradient descent
5:

2 / 12
6: end for

Gradient Descent Approximate Dynamic Programming


Let us compute the gradient of the optimization problem introduced,

This gradient allows us to write an online version of dynamic programming - thus temporal
difference with function approximation.

Gradient Updates
Let be parametric, i.e., ,

where , and .

Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for


Value Functions
1: Input: , number of episodes , vector parameter
2: for do
3: Sample first state

4: for do
5: Sample action

6: Update learning rate


7: Apply a on the environment and receive reward and next state
8:

9:
10: end for
11: end for

3 / 12
Temporal Difference with Function Approximation

Temporal Difference with Function Approximation for


Action-Value

Functions
1: Input: , number of episodes , vector parameter
2: for do

3: Sample first state


4: Sample action
5: for do

6: Update learning rate


7: Apply a on the environment and receive reward and next state
8: Sample

9:
10:

11: end for


12: end for

Convergence
Approximate dynamic programming and temporal difference with function
approximation converge to a biased solution when the parametrization is linear.
When the function approximation is not linear (e.g., neural networks), ADP and TD with
function approximation are not guaranteed to converge.

In either case, their estimate is biased. Such bias can be mitigated by introducing many
parameters, but many parameters cause high estimation variance, which can be only
compensated by using many samples.
State-of-the-art reinforcement learning tends to use deep neural networks with million (or
even billion) of parameters, and use a vast amount of samples.

4 / 12
Our Objective
Our objective is to find a policy that performs better than any other policy. In
mathematical terms:

We call such policy optimal. Note: there might be several optimal policies.

Policy Improvement Theorem

Policy Improvement Theorem


if , then .
Note: the policy improvement theorem is more general in textbooks, however, this
simplified (and more specific) version, still allow us to derive the following corollaries and
algorithms.

Greedy Policies: A policy is said to be greedy (w.r.t. ) if

Tabular Policy Iteration

Tabular Policy Iteration


1: Create a table that represent deterministic policies, and initialize it randomly
2: for do

3: PolicyEvaluation (e.g, use DP for policy evaluation as introduced the previous


lecture)
4: for do
5:

6: end for
7: end for
Question: does this algorithm converge to the optimal policy? Yes (for tabular and ).

However, it is computational expensive because after each policy improvement we need to


evaluate the policy.

5 / 12
Policy Improvement Theorem
Corollary 1 If exists such that , then
.
Corollary 2 (Optimality Bellman Equation) The optimal -function satisfies the following
optimality Bellman equation:

Can we use the optimality Bellman equation to obtain some guarantees on the
convergence of policy iteration, and to derive a more efficient algorithm?

Optimality Bellman Operator


Consider the optimality Bellman operator

The optimality Bellman operator is contractive, and, thanks to the Banach's theorem, we
can state that

Once we obtain the optimal action-value function , obtaining an optimal policy is trivial:

(Note: such policy is deterministic).

On the Contractivity of the Optimality Bellman Operator


Consider the optimality Bellman operator

6 / 12
Value Iteration

Tabular Value Iteration


1: Create two tables and that represent the -functions, and that represents the
policy
2: for do

3: for do
4:

5: end for
6: for do

7:
8: end for
9: Return

10: end for

Value Iteration
Value iteration is basically dynamic programming with the max operator that selects the
action of the next state.
is model-based (it needs knowledge of the transition model and the reward). Similarly to
what we have seen the previous lecture, we can derive a model-free, online version, called
-learning.

Online Algorithms
Like in the previous lecture, we want to devise an online algorithm, that uses samples to
update the -function and the policy.

Online Algorithm
1: Initialize a -function
2: for Episodes do

3: Sample first state


4: for Single episode do
5: Sample action according to a policy (which policy?)

7 / 12
6: Apply a on the environment and receive reward and next state
7: Use to update the -function

8:
9: end for
10: end for

Q-learning
Like in the previous lecture, we can use the online averaging and bootstrapping, i.e.,

where is the reward and is the next state.

-Greedy Policies
We want to obtain an "online" algorithm: i.e., an algorithm that improves the policy while
interacting with the environment.
A popular strategy is to use -greedy policies: policies that select the greedy action with
probability , and select a sub-optimal action with probability .

Such policies select with high probability (usually epsilon is small) good actions, while
keeping exploring and avoiding local minima. (For the most curious: see the exploration-
exploitation trade-off)

Q-learning

Tabular Q-learning
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do

4: Sample first state


5: for Single episode do
# Greedy Policy

8 / 12
With probability select , otherwise select a randomly.
# Learning Rate Update

Apply on the environment and receive reward and next state


10: # Bellman Update
11:

12: end for


13: end for

Simulation Time
Let's simulate -learning. We use the "investor MDP" as an example (next slide). There are
three states : rich, well-off, poor .
There are two actions : no-invest, 1 : invest .

We assume constant learning rate , discount factor , and epsilon-greedy


. Episodes are truncated after 5 steps.
I need 5 student:
One student execute the MDP (simulate.py)
One student execute the epsilon-greedy policy (epsilon_greedy.py)
One student compute
One student compute the entire update

One student keeps track of the q-table on the blackboard.

Tabular SARSA
1: Input: , number of episodes

2: Initialize: a table of state-actions visitation , a table of -values


initialized with zeros.
3: for Episodes do
4: Sample first state .

5: With probability select , otherwise select randomly.


6: for Single episode do
7: # Learning Rate Update

8: Apply on the environment and receive reward and next state # Greedy Policy

9 / 12
9: With probability select , otherwise select a random action
.
10: # Bellman Update

11:
12: end for

Q-learning

Q-learning with Function approximation


1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros

3: for Episodes do
4: Sample first state
5: for Single episode do

6: with probability select , otherwise select a random action


a.
7: Update the learning rate .

8: Apply a on the environment and receive reward and next state


9:
10: end for

11: end for

SARSA with Function Approximation


1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros.
3: for Episodes do

4: Sample first state .


5: With probability select , otherwise select randomly.
6: for Single episode do

10 / 12
7: Update the learning rate
8: Apply a on the environment and receive reward and next state

9: With probability select , otherwise select randomly.


10:

11: end for


12: end for

SARSA and Q-Learning


SARSA can be seen an online version of policy iteration (the Bellman update only evaluates
the current policy), and the current policy is greedy w.r.t. the q-function.
Q-learning can be seen as an online version of value iteration (the Bellman update
evaluates the greedy policy).

For this reason, SARSA is an on-policy algorithm (i.e., it evaluates the current policy), while
-learning is off-policy, since it evaluates a greedy policy while using an -greedy policy on
the environment.

Deep Q-Network
Q-learning with function approximation tends to be a bit unstable.
DQN aims to mitigate those instabilities by 1) introducing a target q-function, and 2) by
introducing randomized mini-batch updates (replay buffer).
Target q-functions. To stabilize learning it is useful to avoid bootstrapping. The idea of DQN
is to keep two separate functions, as in DP. This can be done by having two separate sets
of parameters and , and .
The targets parameters are updates once in a while .

Replay buffer. In classic -learning, samples are very correlated, as they are obtained by
running the MDP. Using non i.i.d. samples with function approximation is problematic. The
idea of the replay buffer is to store the last samples, and to sample each time a minibatch
of i.i.d. samples from it.

Deep Q-Network

11 / 12
DQN
1: for Episodes do
2: Sample first state

3: for Single episode do


4: With probability select , otherwise select randomly.
5: Update the learning rate .

6: Apply a on the environment and receive reward and next state


7: Append to .
8: Sample a minimatch

9:

10: end for

11: Every episodes, # Target update


12: end for

12 / 12

You might also like