12 ML Reinforcement Learning Value Based Control
12 ML Reinforcement Learning Value Based Control
1. Function Approximation
Approximate Dynamic Programming
Does ADP converge?
2. Policy Improvement
Policy Improvement Theorem
Policy Iteration
Value Iteration
Q-Learning
SARSA
Deep Networks
Here, and are tables. How can we represent value functions that have infinitely
many states?
Linear parametrization
Neural networks
1 / 12
Dynamic Programming with Function Approximation
Assume you have a dataset of states , actions , rewards , next states and next
actions where and .
3: for do
6: end for
7: Return .
3: for do
4: Minimize using
e.g., gradient descent
5:
2 / 12
6: end for
This gradient allows us to write an online version of dynamic programming - thus temporal
difference with function approximation.
Gradient Updates
Let be parametric, i.e., ,
where , and .
4: for do
5: Sample action
9:
10: end for
11: end for
3 / 12
Temporal Difference with Function Approximation
Functions
1: Input: , number of episodes , vector parameter
2: for do
9:
10:
Convergence
Approximate dynamic programming and temporal difference with function
approximation converge to a biased solution when the parametrization is linear.
When the function approximation is not linear (e.g., neural networks), ADP and TD with
function approximation are not guaranteed to converge.
In either case, their estimate is biased. Such bias can be mitigated by introducing many
parameters, but many parameters cause high estimation variance, which can be only
compensated by using many samples.
State-of-the-art reinforcement learning tends to use deep neural networks with million (or
even billion) of parameters, and use a vast amount of samples.
4 / 12
Our Objective
Our objective is to find a policy that performs better than any other policy. In
mathematical terms:
We call such policy optimal. Note: there might be several optimal policies.
6: end for
7: end for
Question: does this algorithm converge to the optimal policy? Yes (for tabular and ).
5 / 12
Policy Improvement Theorem
Corollary 1 If exists such that , then
.
Corollary 2 (Optimality Bellman Equation) The optimal -function satisfies the following
optimality Bellman equation:
Can we use the optimality Bellman equation to obtain some guarantees on the
convergence of policy iteration, and to derive a more efficient algorithm?
The optimality Bellman operator is contractive, and, thanks to the Banach's theorem, we
can state that
Once we obtain the optimal action-value function , obtaining an optimal policy is trivial:
6 / 12
Value Iteration
3: for do
4:
5: end for
6: for do
7:
8: end for
9: Return
Value Iteration
Value iteration is basically dynamic programming with the max operator that selects the
action of the next state.
is model-based (it needs knowledge of the transition model and the reward). Similarly to
what we have seen the previous lecture, we can derive a model-free, online version, called
-learning.
Online Algorithms
Like in the previous lecture, we want to devise an online algorithm, that uses samples to
update the -function and the policy.
Online Algorithm
1: Initialize a -function
2: for Episodes do
7 / 12
6: Apply a on the environment and receive reward and next state
7: Use to update the -function
8:
9: end for
10: end for
Q-learning
Like in the previous lecture, we can use the online averaging and bootstrapping, i.e.,
-Greedy Policies
We want to obtain an "online" algorithm: i.e., an algorithm that improves the policy while
interacting with the environment.
A popular strategy is to use -greedy policies: policies that select the greedy action with
probability , and select a sub-optimal action with probability .
Such policies select with high probability (usually epsilon is small) good actions, while
keeping exploring and avoiding local minima. (For the most curious: see the exploration-
exploitation trade-off)
Q-learning
Tabular Q-learning
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do
8 / 12
With probability select , otherwise select a randomly.
# Learning Rate Update
Simulation Time
Let's simulate -learning. We use the "investor MDP" as an example (next slide). There are
three states : rich, well-off, poor .
There are two actions : no-invest, 1 : invest .
Tabular SARSA
1: Input: , number of episodes
8: Apply on the environment and receive reward and next state # Greedy Policy
9 / 12
9: With probability select , otherwise select a random action
.
10: # Bellman Update
11:
12: end for
Q-learning
3: for Episodes do
4: Sample first state
5: for Single episode do
10 / 12
7: Update the learning rate
8: Apply a on the environment and receive reward and next state
For this reason, SARSA is an on-policy algorithm (i.e., it evaluates the current policy), while
-learning is off-policy, since it evaluates a greedy policy while using an -greedy policy on
the environment.
Deep Q-Network
Q-learning with function approximation tends to be a bit unstable.
DQN aims to mitigate those instabilities by 1) introducing a target q-function, and 2) by
introducing randomized mini-batch updates (replay buffer).
Target q-functions. To stabilize learning it is useful to avoid bootstrapping. The idea of DQN
is to keep two separate functions, as in DP. This can be done by having two separate sets
of parameters and , and .
The targets parameters are updates once in a while .
Replay buffer. In classic -learning, samples are very correlated, as they are obtained by
running the MDP. Using non i.i.d. samples with function approximation is problematic. The
idea of the replay buffer is to store the last samples, and to sample each time a minibatch
of i.i.d. samples from it.
Deep Q-Network
11 / 12
DQN
1: for Episodes do
2: Sample first state
9:
12 / 12