0% found this document useful (0 votes)

12 views25 pages

Lecture 10

Lecture 10 covers Q-learning, function approximation, and temporal difference learning in the context of reinforcement learning (RL) and Markov Decision Processes (MDPs). It discusses the use of simulation-based algorithms to estimate optimal policies when MDP models are unknown or complex, highlighting Q-learning and policy gradient methods as key approaches. The lecture also explains the Q-function, its significance in RL, and the application of the Robbins-Monro algorithm for estimating the Q-function through stochastic approximation.

Uploaded by

Santhoshkumar K S Research Scholar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Lecture 10

Uploaded by

Santhoshkumar K S Research Scholar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

ECE586 MDPs and Reinforcement Learning University of Illinois at Urbana-Champaign

Spring 2019 Acknowledgment: R. Srikant’s Notes and Discussions

Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning

Instructor: Dimitrios Katselis Scribe: Z. Zhou, L. Buccafusca, W. Wei, C. Shih, K. Li, Z. Guo, D. Phan

Previously, we have considered solving MDPs when the underlying model, i.e., the transition probabilities and the cost
structure are completely known. If the MDP model is unknown or if it is known but it is computationally infeasible
to be used directly except through sampling due to the domain size, then simulation-based stochastic approximation
algorithms for estimating the optimal policy of MDPs can be employed. Here, “simulation-based” means that the
agent/controller can observe the system trajectory under any choice of control actions and therefore, trajectory sampling
from the MDP is performed. We will study two algorithms in this case:

(1) Q-learning, studied in this lecture: It is based on the Robbins–Monro algorithm (stochastic approximation (SA))
to estimate the value function for an unconstrained MDP. A primal-dual Q-learning algorithm can be employed
for MDPs with monotone optimal policies. The Q-learning algorithm also applies as a suboptimal method for
POMDPs.
(2) Policy gradient algorithms, which we will see in later lectures: Such algorithms rely on parametric policy
classes, e.g., on the class of Gibbs policies. They employ gradient estimation of the cost function together with
a stochastic gradient algorithm on the performance surface induced by the selected smoothly parameterized
policy class M = {µθ : θ ∈ Rd } of stochastic stationary policies to estimate the optimal policy. Policy gradient
algorithms apply to MDPs and constrained MDPs, while they yield suboptimal policy search methods for
POMDPs.

Note: Determining the optimal policy of an MDP (or a POMDP) when the model parameters are unknown correspond
to stochastic adaptive control problems. Stochastic adaptive control algorithms are of two types: direct methods,
where the unknown MDP model is estimated simultaneously with updating the control policy, and implicit methods
such as simulation-based methods, where the underlying MDP model is not directly estimated in order to compute the
control policy1 . Q-learning, Temporal Difference (TD) learning and policy gradient algorithms correspond to such
simulation-based methods. Such methods are also called reinforcement learning algorithms.
Reinforcement Learning: Also called neuro-dynamic programming or approximate dynamic programming2 . The first
term is due to the use of neural networks with RL algorithms. Reinforcement learning is a branch of machine learning.
It corresponds to learning how to map situations or states to actions or equivalently to learning how to control a system
in order to minimize or to maximize a numerical performance measure that expresses a long-term objective. The agent
is not told which actions to take, but instead must discover which actions yield the most reward or the least cost by
trying them. Actions may affect the immediate reward or cost and the next situation or state. Thus, actions influence all
subsequent rewards or costs. These two characteristics, i.e., the trial-and-error search and the delayed rewards or costs,
are the two most distinguishing features of reinforcement learning. The main differences of reinforcement learning
from other machine learning paradigms are summarized below:
1 Often in the literature, the terms “direct” and “implicit or indirect” learning are used with reverse associations to methods. With the term “direct
learning” several authors refer to simulation-based methods and the “directness” corresponds to “directly, without estimating an environmental
model”. Similarly, “indirect learning” is used for methods estimating first a model for the environment and then computing an optimal policy via
“certainty equivalence”. As an additional comment of independent interest, it is well-known in adaptive control theory that the certainty equivalence
principle may lead to suboptimal performance due to the lack of exploration.
2 Consider the very rich field known as approximate dynamic programming. Neuro-Dynamic Programming is mainly a theoretical treatment of the

field using the language of control theory. Reinforcement Learning describes the field from the perspective of artificial intelligence and computer
science. Finally, Approximate Dynamic Programming uses the parlance of operations research, with more emphasis on high dimensional problems
that typically arise in this community.

10-1
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-2

(a) There is no supervisor, only a reward or a cost signal which reinforces certain actions over others.
(b) The feedback is typically delayed.
(c) The data are sequential and therefore, time is critical.
(d) The actions of the agent affect the (subsequent) data generation mechanism.

Two major classes of RL methods are3 :

(i) Policy-Iteration based algorithms or “actor-critic” learning: As an initial comment, actor-critic learning is
the (generalized) learning analogue for the Policy Iteration method of Dynamic Programming (DP), i.e., the
corresponding approach that is followed in the context of reinforcement learning due to the lack of knowledge
of the underlying MDP model and possibly due to the use of function approximation if the state-action space
is large. More specifically, actor-critic methods implement Generalized Policy Iteration. Recall that policy
iteration alternates between a complete policy evaluation and a complete policy improvement step. If sample-
based methods or function approximation for the underlying value function or the Q-function4 are employed,
exact evaluation of the policies may require infinite many samples or might be impossible due to the function
approximating class. Hence, RL algorithms simulating policy iteration must change the policy by relying on
partial or incomplete knowledge of the associated value function. Such schemes are said to implement generalized
policy iteration.
To make the discussion more concrete, recall the Policy Iteration method. For an initial policy µ0 = µ the
following iteration is set:
• Policy Evaluation: Compute Jµk = (I − αPµk )−1 c̄µk and let Qµk (i, u) = c̄(i, u) + α j Pij (u)Jµk (j).
P

• Policy
P Improvement: Update µk to the greedy policy associated with Qµk , i.e., to µk+1 (i) = arg minu c̄(i, u)+
α j Pij (u)Jµk (j) = arg minu Qµk (i, u) for all i ∈ X.
We now want to perform the above two steps without access to the true dynamics and reward or cost structure.
Assume that an initial policy µ0 = µ is chosen. The corresponding iteration implemented by an actor-critic
method is subdivided as follows:
• Policy Evaluation (“Critic”): Estimate the value of the current policy of the actor or Qµk by performing a
model-free policy evaluation. This is a value prediction problem.
• Policy Improvement (“Actor”): Update µk based on the estimated Qµk from the previous step.
A comment: As it is clear from the previous description, the actor performs policy improvement. This
improvement can be implemented in a similar spirit as in policy iteration by moving the policy towards the
greedy policy underlying the Q-function estimate obtained from the critic. Alternatively, policy gradient can
be performed directly on the performance surface underlying a chosen parametric policy class. Moreover, the
actor performs some form of exploration to enrich the current policy, i.e., to guarantee that all actions are tried.
Roughly speaking, the exploration process guarantees that all state-action pairs are sampled sufficiently often.
Exploration of all actions available at a particular state is important, even if they might be suboptimal with respect
to the current Q-estimate. Moreover, the policy evaluation step produces an estimate of Qµk . Clearly, a point of
vital importance is that the policy improvement step monotonically improves the policy as in the model-based
case.
Methods for policy evaluation (“critic”) include:
• Monte Carlo policy evaluation.
• Temporal Difference methods: TD(λ), SARSA, etc.
3 These classes fall within the more general class of value-function based schemes.
4 or both sample-based methods with function approximation
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-3

(ii) Value-Iteration based algorithms: Such approaches are based on some online version of value iteration
Jˆk+1 (i) = minu c̄(i, u) + a j Pij (u)Jˆk (j), ∀i ∈ X. The basic learning algorithm in this class is Q-learning.
P

The aim of Q-learning is to approximate the optimal action-value function Q by generating a sequence {Q̂k }k≥0
of such functions. The underlying idea is that if Q̂k is “close” to Q for some k, then the corresponding greedy
policy with respect to Q̂k will be close to the optimal policy µ∗ which is greedy with respect to Q.

10.1 Q-function and Q-learning

The Q-learning algorithm is a widely used model-free reinforcement learning algorithm. It corresponds to the
Robbins–Monro stochastic approximation algorithm applied to estimate the value function of Bellman’s dynamic
programming equation. It was introduced in 1989 by Christopher J. C. H. Watkins in his PhD Thesis. A convergence
proof was presented by Christopher J. C. H. Watkins and Peter Dayan in 1992. A more detailed mathematical proof
was given by John Tsitsiklis in 1994, and by Dimitri Bertsekas and John Tsitsiklis in their book on Neuro-Dynamic
Programming in 1996. As a final comment, although Q-learning is a cornerstone of the RL field, it does not really scale
to large state-control spaces. Large state-control spaces associated with complex problems can be handled by using
state aggregation or approximation techniques for the Q-values.
Recall that Bellman’s dynamic programming equation for a discounted cost MDP is:
X
J ∗ (i) = min c̄(i, u) + α Pij (u)J ∗ (j)
u∈U
j

= min c̄(i, u) + αE[J ∗ (xk+1 )|xk = i, uk = u]. (10.1)

u∈U

Definition 10.1. (Q-function or state-action value function) The (optimal) Q-function is defined by
X
Q(i, u) = c̄(i, u) + α Pij (u)J ∗ (j), i ∈ X, u ∈ U. (10.2)
j

If the Q-function is known, then J ∗ (i) = minu Q(i, u) and the optimal policy is greedy with respect to Q i.e.,
µ∗ (i) = arg minu Q(i, u). Combining the previous results, we obtain:
X
Q(i, u) = c̄(i, u) + α Pij (u) min Q(j, v)
v
j
h i
= c̄(i, u) + αE min Q(xk+1 , v)|xk = i, uk = u . (10.3)
v

By abusing notation, we will denote by T the operator in the right-hand side of the previous equation. Then, (10.3) can
be alternatively written as:
Q = T (Q) , (10.4)
where
X
T (Q)(i, u) = c̄(i, u) + α Pij (u) min Q(j, v)
v
j
h i
= E c(xk , uk ) + α min Q(xk+1 , v)|xk = i, uk = u . (10.5)
v

Theorem 10.2. T is a contraction mapping (with respect to k·k∞ ) with parameter α < 1, i.e.

kT (Q1 ) − T (Q2 )k∞ ≤ α kQ1 − Q2 k∞ .

Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-4

Proof.
X
(T (Q1 ) − T (Q2 ))(i, u) = α Pij (u) min Q1 (j, v) − min Q2 (j, v) .
v v
j

= α kQ1 − Q2 k∞ ,
which leads to kT (Q1 ) − T (Q2 )k∞ ≤ α kQ1 − Q2 k∞ since the obtained bound is valid for any (i, u) ∈ X × U. We
further note that the second inequality is due to the fact that for two vectors x and y of equal dimension:

min xi − min yi ≤ max |xi − yi | .

i i i

Lemma 10.3. For two vectors x and y of equal dimension,

min xi − min yi ≤ max |xi − yi | .

i i i

Proof. Assume without loss of generality that mini xi ≥ mini yi and that i1 = arg mini yi . Then:

min xi − min yi = min xi − min yi = min xi − yi1 ≤ xi1 − yi1 |{z}

= |xi1 − yi1 | ≤ max |xi − yi |.
i i i i i i
(∗)

(∗) is due to the fact that yi1 = mini yi ≤ mini xi ≤ xi1 .

Note: For two vectors x and y of equal dimension, |maxi xi − maxi yi | ≤ maxi |xi − yi |. This inequality is useful in
the case of a problem with rewards. PIt can be used to show the contraction property of the corresponding T operator
defined as T (Q)(i, u) = r̄(i, u) + α j Pij (u) maxv Q(j, v) in this case.
Since T is a contraction mapping, we can use value iteration to compute Q(i, u) if the MDP model is known. In
practice, the MDP model is unknown. This problem is addressed via Q-learning.

10.1.1 Q-learning

By now, it should be clear that Q-learning means learning the Q-function. In the following, we introduce two versions
of Q-learning: synchronous and asynchronous. Before this, we examine the rationale behind Q-learning and its
connection with the fundamental Robbins-Monro algorithm.
Observe that (10.1) has an expectation inside the minimization, while (10.3) has an expectation outside the minimization.
This crucial observation forms the basis for using stochastic approximation algorithms to estimate the Q-function. Note
that (10.3) can be written as:

E [h(Q)(xk , uk , xk+1 )|xk = i, uk = u] = 0, ∀(i, u) ∈ X × U (10.6)

Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-5

where
h(Q)(xk , uk , xk+1 ) = c(xk , uk ) + α min Q(xk+1 , v) − Q(xk , uk ). (10.7)
v

The Robbins-Monro algorithm can be used to estimate the solution of (10.6) via the recursion:

Q̂k+1 (xk , uk ) = Q̂k (xk , uk ) + εk (xk , uk )h(Q̂k )(xk , uk , xk+1 )

= Q̂k (xk , uk ) + εk (xk , uk ) c(xk , uk ) + α min Q̂k (xk+1 , v) − Q̂k (xk , uk ) (10.8)
v

or
Q̂k+1 (xk , uk ) = (1 − εk (xk , uk ))Q̂k (xk , uk ) + εk (xk , uk ) c(xk , uk ) + α min Q̂k (xk+1 , v) . (10.9)
v

Note: As we will see later on, the term h(Q)(xk , uk , xk+1 ) is very similar to the temporal difference in TD schemes
for policy evaluation (specifically, to TD(0)), except for the minimization operation applied to Q̂(xk+1 , v). Defining the
temporal difference to incorporate the minimization (or maximization for rewards) operator, Q-learning corresponds to
an instance of temporal difference learning.
Remark: It turns out that the temporal differences underlying Q-learning do not telescope. This is due to the fact
that Q-learning is an inherently off-policy algorithm. We clarify more the notion of off-policy algorithms after the
deterministic Q-learning example that we provide in the following.
The
P decreasing stepsizePsequences (or sequences of learning rates) {εk (i, u)}k≥0 for any (i, u) ∈ X × U must satisfy
2
ε
k k (i, u) = ∞ and ε
k k (i, u) < ∞ in the context of stochastic approximation. These constraints are also called
Robbins-Monro conditions. A possible choice is
ε
εk (i, u) = , (10.10)
Nk (i, u)

where ε > 0 is a constant and Nk (i, u) is the number of times the state-action pair (i, u) has been visited until time k
by the algorithm. The algorithm is summarized as a two-timescale stochastic approximation algorithm in the following
table:

Q-learning Algorithm (Two time-scale implementation)

slow time scale
for n = 1, 2, . . . do
Update the policy as µn (i) = arg minu Q̂nT̄ (i, u), ∀i ∈ X
Fast time scale
for k = nT̄ , nT̄ + 1, . . . , (n + 1)T̄ − 1 do
Given the state xk , choose uk = µn (xk ).
Simulate (or sample) the next state as xk+1 ∼ Pxk ,xk+1 (uk ).
Update the Q-values as:
Q̂k+1 (xk , uk ) = (1 − εk (xk , uk ))Q̂k (xk , uk ) + εk (xk , uk ) c(xk , uk ) + α minv Q̂k (xk+1 , v) ,
where εk (xk , uk ) are given by (10.10).
endfor
endfor

This two-time scale implementation of Q-learning with policy updates during an infinitely long trajectory appears in
the literature, but it is not the standard form of this algorithm.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-6

Remarks:

1. Fast Time Scale: The Q-function is updated by applying the same policy for a fixed period T̄ of time slots referred
to as the update interval.
2. Slow Time Scale: Policy update.
3. Simulate (or sample) the next state as xk+1 ∼ Pxk ,xk+1 (uk ): The Q-learning algorithm does not require explicit
knowledge of the transition probabilities, but simply access to the controlled system to measure its next state
xk+1 when an action uk is applied. Via the same access to the controlled system, the cost signal c(xk , uk ) is
observed.
4. Under some conditions, the Q-learning algorithm for a finite state MDP converges almost surely to the optimal
solution of Bellman’s equation.

Note: The above implementation explicitly describes a way such that the observed trajectory is generated. Clearly,
updated policies participate in this trajectory formation.
In the literature, instead of the previous two-time scale form which incorporates policy update steps by considering the
greedy policies with respect to the underlying Q-function estimates at the end of update intervals, Q-learning is usually
presented in the following two variations, where the generation of the underlying system trajectory is not explicitly
specified:
Synchronous Q-learning: At time k + 1, we update the Q-function as

Q̂k+1 (i, u) = Q̂k (i, u) + εk c(i, u) + α min Q̂k (j, v) − Q̂k (i, u)
v

= (1 − εk )Q̂k (i, u) + εk c(i, u) + α min Q̂k (j, v) ∀(i, u) ∈ X × U. (10.11)
v

This version is known as synchronous Q-learning because the update is taken for all state-action pairs (i, u) per
iteration. Here, xk+1 = j when uk = u is applied to state xk = i. In other words, at stage k, a random variable
xk+1 = xk+1 (i, u) is simulated with probability Pi· (u) to implement the update of the Q-value for each state action-pair
(i, u).
Asynchronous Q-learning: Suppose that xk = i. With probability pk we choose uk = arg minu Q̂k (xk , u) and with
probability (1 − pk ) we choose any action uniformly at random. This ensures that all state-action pairs (i, u) are
explored instead of just exploiting the current knowledge of Q̂k . At time k + 1, we update the Q-function as

Q̂k+1 (i, u) = (1 − εk (i, u))Q̂k (i, u) + εk (i, u) c(i, u) + α min Q̂k (j, v)
v

and for (x, v) 6= (i, u):

Q̂k+1 (x, v) = Q̂k (x, v) .

In practice, asynchronous Q-learning is used. Moreover, the term “Q-learning” is reserved almost exclusively for
asynchronous Q-learning.
Stochastic Approximation Rationale in relevance to (10.4): We want to solve Q = T (Q). Borrowing the idea from
stochastic approximation, we can use the iteration
r̂k+1 = (1 − εk )r̂k + εk (T (r̂k ) + noise),
to solve an equation of the form r = T (r) or T (r) − r = 0.
Policy Selection: For the described synchronous and asynchornous Q-learning schemes, either we theoretically wait
until convergence and then we choose the greedy policy with respect to the limit, or we stop the iteration earlier, at
some Q̂k and we extract the corresponding greedy policy as our control choice.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-7

10.1.2 An Illustrative Case: Deterministic Q-learning

Consider a deterministic MDP model:

xk+1 = f (xk , uk )
ck = c(xk , uk ).

Let the performance metric for some policy µ be the discounted cost objective:
∞
X
Jµ (i) = αk ck , x0 = i, uk = µ(xk ).
k=0

Then, J ∗ (i) = minµ Jµ (i), ∀i ∈ X. Moreover, the Q-function is defined is this case as:

Q(i, u) = c(i, u) + αJ ∗ (f (i, u)), ∀(i, u) ∈ X × U.

Clearly, J ∗ (i) = minu Q(i, u). Moreover, the above definition can be given only in terms of the Q-function as follows:

Q(i, u) = c(i, u) + α min Q(f (i, u), v), ∀(i, u) ∈ X × U.

Q-learning in this setup is implemented as follows:

Deterministic Q-learning Algorithm

Initialization: Q̂0 (i, u) = Q0 (i, u), ∀(i, u) ∈ X × U
for k = 1, 2, . . . do
Update: Q̂k+1 (xk , uk ) = ck + α minv Q̂k (xk+1 , v).
endfor

Note: This algorithm relies on a particular system trajectory. However, no reference on how to choose the actions is
made.
Theorem 10.4. Consider the previous deterministic MDP model and the described Q-learning algorithm for this
scenario. If each state-action pair is visited infinitely often, then Q̂k (i, u) −−−−→ Q(i, u) for every state-action pair
k→∞
(i, u) ∈ X × U.

Proof. Let ∆k = kQ̂k − Qk∞ = maxi,u Q̂k (i, u) − Q(i, u) . Then, at every time instant k + 1 we have:

Q̂k+1 (i, u) − Q(i, u) = ck + α min Q̂k (xk+1 , v) − ck + α min Q(xk+1 , v)
v v

= α min Q̂k (xk+1 , v) − min Q(xk+1 , v)

v v

≤ α max Q̂k (xk+1 , v) − Q(xk+1 , v)

≤ α max Q̂k (x, v) − Q(x, v) = α∆k .

x,v

Consider now any time interval {k1 , k1 + 1, . . . , k2 } in which each state-action pair is visited at least once. Then, by
the previous derivation:
∆k2 ≤ α∆k1 ,
with an elementwise interpretation of the inequality. Since as k → ∞ there are infinite many such intervals and α < 1,
we conclude that ∆k → 0 as k → ∞.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-8

Remark: Guaranteeing that each state-action pair is visited infinitely often ensures that as k → ∞ there are infinite
many time intervals {k1 , k1 + 1, . . . , k2 } such that each state-action pair is visited at least once. To ensure this
requirement, some form of exploration is often used. The speed of convergence may depend critically on the efficiency
of the exploration.
Key Aspect: As mentioned before, there is no reference to the policy used to generate the system trajectory, i.e., an
arbitrary policy may be used for this purpose. For this reason, Q-learning is an off-policy algorithm. An alternative
way to explain the term “off-policy learning” is learning a policy by following another policy. A distinguishing
characteristic of off-policy algorithms like Q-learning is the fact that the update rule need not have any relation to the
underlying learning policy generating the system trajectory. The updates of Q̂k (xk , uk ) in the Deterministic Q-learning
algorithm and in (10.8)-(10.9) depend on minv Q̂k (xk+1 , v), i.e., the estimated Q-functions are updated on the basis
of hypothetical actions or more precisely actions other than those actually executed. On the other hand, on-policy
algorithms learn the underlying policy as well. Moreover, on-policy algorithms update value functions strictly on the
basis of the experience gained from executing some (possibly nonstationary) policy.

10.1.3 Used policy during learning

Q-learning is a fairly simple algorithm to implement. Additionally, it permits the use of arbitrary policies to generate
the training data provided that in the limit, all state-action pairs are visited and therefore are updated infinitely often.
Action sampling strategies to achieve this requirement in closed-loop learning are -greedy action selection schemes or
Boltzmann exploration that we provide in the sequel5 . More generally, any persistent exploration method will ensure
the aforementioned target requirement. With appropriate tuning, asymptotic consistency can be achieved.

10.2 Convergence of Q-learning

The Q-learning algorithm can be viewed as a stochastic process to which techniques of stochastic approximation
can be applied. The following proof of convergence relies on an extension of Dvoretzky’s (1956) formulation of the
classical Robbins-Monro (1951) stochastic approximation theory to obtain a class of converging processes involving
the maximum norm.
Theorem 10.5. A random iterative process ∆n+1 (x) = (1 − an (x))∆n (x) + bn (x)Fn (x) converges to zero with
probability 1 under the following assumptions:

1. The state space is finite.

P P 2 P P 2
2. n an (x) = ∞, n an (x) < ∞, n bn (x) = ∞, n bn (x) < ∞ and E[bn (x)|Fn ] ≤ E[an (x)|Fn ] uni-
formly with probability 1.
3. kE[Fn (x)|Fn ]kw ≤ γk∆n kw for γ ∈ (0, 1).
4. Var(Fn (x)|Fn ) ≤ C(1 + k∆n kw )2 , where C is some constant.

Here, Fn = {∆n , ∆n−1 , . . . , Fn−1 , . . . , an−1 , . . . , bn−1 , . . .} stands for the history at step n. Fn (x), an (x) and bn (x)
are allowed to depend on the past insofar as the above conditions remain valid. Finally, k · kw denotes some weighted
maximum norm.
5 We have seen both -greedy schemes and Boltzmann exploration in the context of multi-armed bandit problems in a previous lecture file.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-9

Weighted Maximum Norm: Let w = [w1 , . . . , wn ]T ∈ Rn be a positive vector. Then, for a vector x ∈ Rn the
following weighted max-norm based on w can be defined:

xi
kxkw = max . (10.12)
1≤i≤n wi

This norm induces a matrix norm. For a matrix A ∈ Rn×n the following weighted max-norm based on w can be
defined:
kAkw = max {kAxkw |x ∈ Rn } . (10.13)
kxkw =1

For w = [1, . . . , 1]T the usual maximum norm is recovered.

The process ∆n will generally represent the difference between a stochastic process of interest and some optimal value
(e.g., the optimal value function).
Theorem 10.6. (Q-learning convergence) Given an MDP with finite state and action spaces, the Q-learning algorithm,
given by the update rule

Q̂k+1 (xk , uk ) = (1 − εk (xk , uk ))Q̂k (xk , uk ) + εk (xk , uk ) c(xk , uk ) + α min Q̂k (xk+1 , v) (10.14)
v

converges with probability 1 to the optimal Q-function, i.e., the one with values given by (10.2), as long as
X X
εk (x, u) = ∞ and ε2k (x, u) < ∞, ∀(x, u) ∈ X × U (10.15)
k k

uniformly with probability 1 and Var(c(x, u)) is bounded.

Note: Under the usual consideration that εk (xk , uk ) ∈ [0, 1), (10.15) requires that all state-action pairs are visited
infinitely often.

Proof. Let
∆k (x, u) = Q̂k (x, u) − Q(x, u),
where Q(x, u) is given by (10.2). Subtracting from both sides of (10.14) Q(xk , uk ), we obtain:
 
 
∆k+1 (xk , uk ) = (1 − εk (xk , uk ))∆k (xk , uk ) + εk (xk , uk ) 
c(xk , uk ) + α min Q̂k (xk+1 , v) − Q(xk , uk )
v 
| {z }
Fk (xk ,uk )
(10.16)

by defining
Fk (x, u) = c(x, u) + α min Q̂k (X(x, u), v) − Q(x, u),
v

where X(x, u) is a random sample state obtained by the Markov chain with state space X and transition matrix
P (u) = [Pij (u)]. Therefore, the Q-learning algorithm has the form of the process in the previous theorem with
an (x) = bn (x) ← εn (x, u). It is now easy to see that

E[Fk (x, u)|Fk ] = T (Q̂k )(x, u) − Q(x, u),

where T is given by (10.5). Using the fact that T is a contraction mapping and Q = T (Q), we further have:

E[Fk (x, u)|Fk ] = T (Q̂k )(x, u) − T (Q)(x, u).

Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-10

Therefore,
kE[Fk (x, u)|Fk ]k∞ = kT (Q̂k ) − T (Q)k∞ ≤ αkQ̂k − Qk∞ = αk∆k k∞ .

where the last step is due to the (at most) linear dependence of the argument on Q̂k (X(x, u), v) and the underlying
assumption that the variance of c(x, u) is bounded.

1
Asymptotic Convergence Rate of Q-learning: For discounted MDPs with discount factor
q 2 < α < 1, the asymptotic
1 1 log log k
if δ(1 − α) ≥ 12 , provided that

rate of convergence of Q-learning is O kδ(1−α) if δ(1 − α) < 2 and O k
pmin
the state-action pairs are sampled from a fixed probability distribution. Here, δ = pmax is the ratio of the minimum and
maximum state-action occupation frequencies.
Remark: In the context of function approximation discussed in subsequent sections, we refer to the so-called ODE
method to show convergence of stochastic recursions. In the end of this file, we provide a brief appendix with a
comment on explicitly looking into Q-learning convergence via the ODE method. For simplicity, we focus there on
the convergence of the synchronous Q-learning algorithm. A similar comment can be made for the asynchronous
Q-learning algorithm as well.

10.3 Boltzmann and -Greedy Explorations

Boltzmann exploration chooses the next action according to the following probability distribution:

exp − Q̂k (x,u)
T
p(uk = u|xk = x) = P . (10.17)
Q̂k (x,v)
v exp − T

T is called temperature. In the literature, Boltzmann exploration is also known as softmax approximation (when rewards
are employed instead of costs and the sign in the exponents is plus instead of minus). In statistical mechanics, the
softmax function is known as the Boltzmann or Gibbs distribution. In general, the softmax function has the following
form,
exp(βyi )
σ(y; β)i = Pn , ∀i = 1, ..., n, y = (y1 , ..., yn ) ∈ Rn , β > 0. (10.18)
j=1 exp(βy j )

It is easy to check that the softmax function defines a probability mass function (pmf) over the index set {1, ..., n}.
For β = 0 (or T = ∞), all indices are assigned an equal mass, i.e., the underlying pmf is uniform. As β increases,
indices corresponding to larger elements in y are assigned higher probabilities. In this sense, “softmax” is a smooth
approximation of the “max” function. When β → +∞, we have that

exp(βyi )
lim σ(y; β)i = lim Pn = 1{i=arg maxj yj } .
β→+∞ β→+∞ j=1 exp(βyj )
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-11

In other words, the softmax function assigns all the mass to the maximum element of y as β → +∞. We note here that
this holds if the maximum element of y is unique, otherwise all the mass is assigned (uniformly) to the set of maxima
of y.
In the form of (10.17), uk becomes the solution of

min Q̂k (xk , v),

i.e., the policy becomes greedy as T → 0 (In the form of (10.18), limβ→−∞ σ(y; β)i = 1{i=arg minj yj } , if the minimum
element of y is unique, otherwise all the mass is assigned (uniformly) to the set of minima of y).
In practice, depending on the learning goal, we may choose a temperature schedule such that Tk → 0 as k → ∞, which
guarantees that arg minv Q̂k (xk , v) is used as control in the limit k → ∞. This corresponds to a decaying exploration
scheme. We can also choose T to have a constant value. Constant temperature Boltzmann exploration corresponds to a
persistent exploration scheme. We discuss decaying and persistent exploration shortly.
In -greedy exploration, uk = arg minv Q̂k (xk , v) is used with probability 1 − and with probability any action
chosen at random is taken. This corresponds to a persistent exploration scheme. The value of can be reduced over
time, gradually moving the emphasis from exploration to exploitation. The last scheme is a paradigm of decaying
exploration.
Remarks:

1. Although the above exploration methods are often used in practice, they are local schemes, i.e., they rely on the
current state and therefore convergence is often slow.
2. More broadly, there are schemes with a more global view of the exploration process. Such schemes are designed
to explore “interesting” parts of the state space. We will not pursue such schemes here.

Decaying and Persistent Exploration: We now clarify the difference between decaying exploration and persistent
exploration schemes that exist in the literature. Decaying exploration schemes become over time greedy for choosing
actions, while persistent exploration schemes do not. The advantage of decaying exploration is that the actions taken by
the system may converge to the optimal ones eventually, but with the price that their ability to adapt slows down. On
the contrary, persistent exploration methods can retain their adaptivity forever, but with the price that the actions of the
system will not converge to optimality in the standard sense.
A class of decaying exploration schemes that is of interest in the sequel is defined as follows:

Definition 10.7. (Greedy in the Limit with Infinite Exploration) (GLIE): Consider exploration schemes satisfying
the following properties:

• Each action is visited infinitely often in every state that is visited infinitely often.

• In the limit, the chosen actions are greedy with respect to the learned Q-function with probability 1.

The first condition requires that exploration is performed indefinitely. The second condition requires that the exploration
is decaying over time and the emphasis is gradually passed to exploitation.
-greedy GLIE exploration: Let xk = i. Pick the corresponding greedy action with probability 1 − k (i) and an
action at random with probability k (i). Here, k (i) = Nkν(i) , 0 < ν < 1 and Nk (i) is the number of visits to the
current state xk = i by time k.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-12

Boltzmann GLIE exploration:

exp −βk (x)Q̂k (x, u)
p(uk = u|xk = x) = P , (10.19)
v exp −β k (x)Q̂k (x, v)

where
log Nk (x)
βk (x) = and Ck (x) ≥ max |Q̂k (x, v) − Q̂k (x, v 0 )|.
Ck (x) v,v 0

10.4 SARSA

As we already mentioned, in Q-learning, the policy used to estimate Q is irrelevant as long as the state-action space is
adequately explored. In particular, to obtain Q̂k+1 (xk , uk ) we use minv Q̂k (xk+1 , v) in the Q-update instead of the
actual control used in the next step. SARSA (State–Action–Reward–State–Action) is another scheme for updating the
Q−function. The corresponding update in this case is:

Q̂k+1 (xk , uk ) = (1 − εk (xk , uk ))Q̂k (xk , uk ) + εk (xk , uk ) c(xk , uk ) + αQ̂k (xk+1 , uk+1 ) (10.20)

and Q̂k+1 (x, u) = Q̂k (x, u) for all (x, u) 6= (xk , uk ). Note that Q̂k (xk+1 , uk+1 ) replaces minv Q̂k (xk+1 , v) in (10.9).
This scheme also converges if the greedy policy uk+1 = arg minv Q̂k+1 (xk+1 , v) is chosen as k → ∞. In this case,
due to this convergence,
lim min Q̂k+1 (xk+1 , v) = lim min Q̂k (xk+1 , v)
k→∞ v k→∞ v

for any fixed state xk+1 = j and therefore, SARSA and Q-learning iterations will coincide in the limit.
Note: SARSA and Q-learning iterations coincide if a greedy learning policy is applied (i.e., always the greedy action
with respect to the current Q-estimate is chosen).

Remarks:

1. SARSA is an on-policy scheme because in the update (10.20) the action uk+1 that was taken according to
the underlying policy is used. Compare this update with the off-policy Q-learning update in (10.9), where the
Q-values are updated using the greedy action corresponding to minv Q̂k (xk+1 , v).
2. As we will see later on, when the underlying policy µ is fixed, SARSA is equivalent to TD(0) applied to
state-action pairs.

3. The rate of convergence of SARSA coincides with the rate of convergence of TD(0) and is O √1k . This
is due to the fact that these algorithms are standard linear stochastic approximation methods. Nevertheless,
the corresponding constant in this rate of convergence will be heavily influenced by the choice of the stepsize
sequence, the underlying MDP model and the discount factor α.
4. SARSA can be extended to a multi-step version known as SARSA(λ). The notion of a multi-step extension will
be clarified later on, in our discussion on TD learning and specifically on the TD(λ) algorithm. We will not
pursue this concept any further here.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-13

Convergence of SARSA: As mentioned earlier, for off-policy methods like Q-learning, the only requirement for
convergence is that each state-action pair is visited infinitely often. For on-policy schemes like SARSA which learn the
Q-function of the underlying learning policy, exploration has to eventually become small to ensure convergence to the
optimal policy6 .
Theorem 10.8. In finite state-action MDPs, the SARSA estimate Q̂k converges to the optimal Q-function and the
underlying learning policy µk converges to an optimal policy µ∗ with probability 1 if the exploration scheme (or
equivalently the learning policy) is GLIE and the following additional conditions hold:

2
P P
1. The learning rates satisfy 0 ≤ εk (i, u) ≤ 1, k εk (i, u) = ∞, k εk (i, u) < ∞ and εk (i, u) = 0 unless
(xk , uk ) = (i, u).

2. Var(c(i, u)) < ∞ (or Var(r(i, u)) < ∞ for a problem with rewards).
3. The controlled Markov chain is communicating: every state can be reached from any other with positive
probability (under some policy).

Note: Classification of MDPs:

• Recurrent or Ergodic: The Markov chain corresponding to every deterministic stationary policy consists of a
single recurrent class.
• Unichain: The Markov chain corresponding to every deterministic stationary policy consists of a single recurrent
class plus a possibly empty set of transient states.
• Communicating: For every pair of states (i, j) ∈ X × X, there exists a deterministic stationary policy under
which j is accessible from i.

10.5 Q-approximation

Consider finite state-action spaces. If the size of the table representation of the Q-function is very large, then the
Q-function can be appropriately approximated to cope with this problem. Function approximation can be also applied
in the case of infinite state-action spaces.

10.5.1 Function approximation

Generic (Value) Function Approximation: To demonstrate some ideas, we first consider function approximation
for a generic function J : X → R. Let J = Jθ : θ ∈ RK be a family of real-valued functions on the state space
X. Suppose that any function in J is a linear combination of a set of K fixed linearly independent (basis) functions
φr : X → R, r = 1, 2, . . . , K. More explicitly, for θ ∈ RK ,
K
X
Jθ (x) = θr φr (x) = φT (x)θ.
r=1

A usual assumption is that kφ(x)k2 ≤ 1 uniformly on X, which can be achieved by normalizing the basis functions.
Compactly,
Jθ = Φθ,
6 For example, SARSA is often implemented in practice with -greedy exploration for a diminishing (-greedy GLIE exploration).
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-14

where    T 
Jθ (1) φ (1)
Jθ =  .. ..
 and Φ =  .
   
. .
T
Jθ (X) φ (X)
The components φr (x) of vector φ(x) are called features of state x and Φ is called a feature extraction. Φ has to be
judiciously chosen based on our knowledge about the problem.
Feature Selection: In general, features can be chosen in many ways. Suppose that X ⊂ R. We can then use a
polynomial, Fourier or wavelet basis up to some order. For example, a polynomial basis corresponds to choosing
φ(x) = [1, x, x2 , . . . , xK−1 ]T . Clearly, the set of monomials {1, x, x2 , . . . , xK−1 } is a linearly independent set and
forms a basis for the vector space of all polynomials with degree ≤ K − 1. A different option is to choose an
orthogonal system of polynomials, e.g., Hermite, Laguerre or Jacobi polynomials, including the important special cases
of Chebyshev (or Tchebyshev) polynomials and Legendre polynomials.
If X is multi-dimensional, then the tensor product construction is a commonly used way to construct features. To
elaborate, let X ⊂ Q X1 × X2 × · · · × Xn and φi : Xi → Rki , i = 1, 2, . . . , n. Then, the tensor product φ = φ1 ⊗ φ2 ⊗
n
· · · ⊗ φn will have i=1 ki components for a given state vector x, which can be indexed using multi-indices of the form
(i1 , i2 , . . . , in ), 1 ≤ ir ≤ kr , r = 1, 2, . . . , n. With this notation, φ(i1 ,i2 ,...,in ) (x) = φ1,i1 (x1 )φ2,i2 (x2 ) · φn,in (xn ). If
X ⊂ Rn , then a popular choice for {φ1 , φ2 , . . . , φn } are the Radial Basis Function (RBF) networks
h iT
(1) (k ) (r)
φi (xi ) = G(|xi − xi |), . . . , G(|xi − xi i |) , xi ∈ R, r = 1, 2, . . . , ki ,

where G is some user-determined function, often a Gaussian function of the form G(x) = exp(−γx2 ) for some
parameter γ > 0. Moreover, kernel smoothing is also an option for some appropriate choice of points x(i) , i =
1, 2, . . . , K:
K
X G(kx − x(i) k)
Jθ (x) = θi PK .
(j)
i=1 j=1 G(kx − x k)
| {z }
G̃(i) (x)
PK
Here, G̃(i) (x) ≥ 0, ∀x ∈ X and ∀i ∈ {1, 2, . . . , K}. Due to i=1 G̃(i) (x) = 1 for any x ∈ X, Jθ is an example
of an averager. Averager function approximators are non-expansive mappings, i.e., θ → Jθ is non-expansive in the
maximum norm. Therefore, such approximators are suitable for use in reinforcement learning, because they can be
nicely combined with the contraction mappings in this framework.
The previous brief reference to feature selection methods clearly does not exhaust the list of such methods. In general,
many methods for this purpose are available. The interested reader is referred to the relevant literature.
Q-function approximation: In a similar spirit as before, consider basis functions of the form φr : X × U → R,
r = 1, 2, . . . , K. Now φr (x, u), r = 1, 2, . . . , K correspond to the features of a given state-action pair (x, u). For
θ ∈ RK ,
XK
Qθ (x, u) = θr φr (x, u) = φT (x, u)θ.
r=1

Compactly,
Qθ = Φθ,
where    T 
Qθ (1, 1) φ (1, 1)
Qθ =  .. ..
 and Φ =  .
   
. .
T
Qθ (X, U ) φ (X, U )
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-15

10.5.2 Actor-Critic Learning with Function Approximation

Under the described function approximation approach, the problem of learning an optimal policy can be formalized in
terms of learning θ. If K |X × U| = XU , then we have to learn a much smaller vector θ than the Q-vector.
Consider first the following approximate Q-Value Iteration algorithm for a known MDP model:

• Set t = 0 and use an initial guess θ̂0 .

• Compute Q̃t+1 = T (Q̂t ), where Q̂t = Φθ̂t , T (Q̂)(x, u) = E[c(xk , uk ) + a minv Q̂(xk+1 , v)|xk = x, uk = u].
• Find the best approximation to Q̃t+1 by solving the following projection problem,

1 X 1 2
min w(x, u)(φT (x, u)θ − Q̃t+1 (x, u))2 = min Φθ − Q̃t+1
θ 2 θ 2 W
(x,u)
| {z }
weighted least squares problem

where w(x, u) are some positive weights and W = diag(w(1, 1), . . . , w(X, U )). We note here that kxk2W =
xT W x and W is a symmetric positive definite matrix. Moreover, kxkW = kW 1/2 xk2 , where W 1/2 is the square
root of W and k · k2 is the `2 -norm7 .

• Set t ← t + 1, update θ̂t as

1 2
θ̂t+1 = arg min Φθ − Q̃t+1 ,
θ 2 W

and repeat.

Compact Version: More compactly, the above algorithm can be written as:
1 2
θ̂t+1 = arg min Φθ − T (Φθ̂t ) , t = 0, 1, 2, . . . (10.21)
θ 2 W

for some initialization θ̂0 .

Fix now a stationary policy µ. Assume that the transition probabilities Pij (µ(i)) are known for this policy, as well as the
cost structure. The associated Q-function at the state-action pairs (i, µ(i)) (i.e., the corresponding value function Jµ )
can be approximated and computed by straightforwardly adapting the above algorithm. To formalize the corresponding
iteration, the operator T is defined in this case as:

T (Q̂)(x) = T (Q̂)(x, µ(x)) = E[c(xk , uk ) + aQ̂(xk+1 , µ(xk+1 ))|xk = x, uk = µ(x)]. (10.22)

The following example aims to show that the choice of the weighted norm k · kW is critical for guaranteeing convergence
of the above scheme.

10.5.2.1 Example

Consider a two-state MDP with no cost and let µ be a stationary policy. The corresponding Markov chain is8 :
7 The notation k · k
W,2 is used sometimes for this norm to differentiate it, e.g., from definitions of weighted max-norms as in (10.12). Nevertheless,
we simplify notation by discarding the subscript “2”. We work exclusively with this weighted `2 -norm for the rest of this file.
8 A stationary policy and an MDP induce a Markov reward process. A more proper term in our case is a Markov cost process.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-16

Figure 10.1: Two-state Markov chain.

We suppress from the subsequent notation of all the involved functions the action u, since the policy µ is assumed to be
fixed and therefore u = µ(i) when at state i.
Since the underlying MDP model is known for this problem, we can implement the previously described algorithm for
the operator T given by (10.22). First, for each node of the Markov chain we have:

X
T (Q̂)(1) = c(1) + α P1j Q̂(j)
j

= 0 + αP11 Q̂(1) + αP12 Q̂(2)

= αεQ̂(1) + α(1 − ε)Q̂(2)
X (10.23)
T (Q̂)(2) = c(2) + α P2j Q̂(j)
j

= 0 + αP21 Q̂(1) + αP22 Q̂(2)

= αεQ̂(1) + α(1 − ε)Q̂(2)

Compactly:

T (Q̂)(1) αε α(1 − ε) Q̂(1)
T (Q̂) = = . (10.24)
T (Q̂)(2) αε α(1 − ε) Q̂(2)
Since there is no cost associated with this problem, we can easily conclude that Q(1) = Q(2) = 0, where Q corresponds
to the (optimal) Q-function and Q(i) = Q(i, µ∗ (i)), i = 1, 2 with µ∗ being an optimal policy. Moreover, at any state
any action is optimal and therefore µ is an optimal policy. Consider the function class:

Q = {Φθ : θ ∈ R}

with Φ = [1, 2]T . Clearly, Q ∈ Q, since Q = Φ · 0. We now note that:

αε α(1 − ε) θ
T (Φθ) = . (10.25)
αε α(1 − ε) 2θ

If the standard Euclidean norm is used for projection, the projection objective becomes:
1 2 1 1
min Φθ − Q̃t+1 = min [θ − θ̂t (αε + 2α(1 − ε))]2 + [2θ − θ̂t (αε + 2α(1 − ε))]2 . (10.26)
θ 2 2 θ 2 2

Differentiating with respect to θ and setting the derivative to 0, we obtain

3
θ̂t+1 = α(2 − ε) θ̂t = α̃θ̂t , (10.27)
5
| {z }
α̃
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-17

which corresponds to the exact form of (10.21) for this particular problem. If α̃ > 1, then the algorithm diverges (unless
θ̂0 = 0). Clearly, this can happen for many α and ε. Fortunately, using a different weighted norm, we can guarantee
convergence for any α, ε.
Let
w1 0
W = (10.28)
0 w2
for some positive constants w1 , w2 . Using k · kW based on such a diagonal matrix in the projection objective, we have:

1 2 1 1
min Φθ − Q̃t+1 = w1 [θ − θ̂t (αε + 2α(1 − ε))]2 + w2 [2θ − θ̂t (αε + 2α(1 − ε))]2 . (10.29)
θ 2 W 2 2
Differentiating with respect to θ and setting the derivative to 0, we obtain the recursion:

α(2 − ε)(w1 + 2w2 )

θ̂t+1 = θ̂t . (10.30)
(w1 + 4w2 )

Idea: Choose the weights w1 , w2 as the entries of the stationary distribution for the Markov chain in Fig. 10.1:

P stationary distribution of the Markov chain can be obtained by solving the equation π = πP using the constraint
The
i πi = 1. π = πP yields in this case (1 − ε)π1 = επ2 . Furthermore, π1 + π2 = 1. Therefore, π1 = ε and π2 = 1 − ε.
We now let w1 = π1 and w2 = π2 . The update rule can be rewritten as:

α(2 − ε)(ε + 2(1 − ε))

θ̂t+1 = θ̂t
(ε + 4(1 − ε))

ε(1 − ε) (10.31)
=α 1− θ̂t = α̃θ̂t .
4 − 3ε
| {z }
ã

Clearly, α̃ < 1 for any α, ε in this case and therefore, the algorithm converges. This suggests that one should perhaps
use W = diag(π) to define the projection objective, where π is the stationary distribution of the Markov chain for the
underlying stationary policy (assuming that such an invariant measure exists).

10.5.2.2 Algorithm

With W = diag(π), the algorithm becomes:

1 2
θ̂t+1 = arg min Φθ − T (Φθ̂t ) , t = 0, 1, 2, . . . (10.32)
θ 2 π

for some initial θ̂0 . Here, we abuse notation to write9 k · kπ for k · kdiag(π) . Also, Q̂ = [Q̂(i)] = [φT (i)θ̂].

10.5.2.3 Convergence Proof for (10.32)

Denote by Π the projection operator defined by the optimization problem:

1 2
min Q̂ − z ,
θ:Q̂=Φθ 2 π

9 and of course we recall that this is a simplified notation for k · kπ,2 , which differs from the weighted max-norm in (10.12).
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-18

i.e., the solution of the above problem is Q̄ = Π(z). Then, the algorithm given by (10.32) becomes
Q̂t+1 = Π(T (Q̂t )). (10.33)
We will show that Π(T (·)) is a contraction mapping in k · kπ , assuming that we know the underlying MDP model.

Proof. First, note that

kΠ(T (Q̂1 )) − Π(T (Q̂2 ))kπ ≤ kT (Q̂1 ) − T (Q̂2 )kπ , (10.34)
because the projection operator Π is non-expansive with respect to k · kπ . This can be easily proved by the first-order
optimality condition for convex optimization and by applying Cauchy–Schwarz inequality (prove it!).
Moreover:
kT (Q1 ) − T (Q2 )k2π = kαPµ (Q1 − Q2 )k2π
 2
X X
=α πi  Pij (µ(i))(Q1j − Q2j )
i j
X X
Pij (µ(i))(Q1j − Q2j )2 Jensen’s inequality : (E[x])2 ≤ E[x2 ]

≤α πi
i j
X X
=α (Q1j − Q2j )2 πi Pij (µ(i))
j i
X
=α πj (Q1j − Q2j )2
j

= αkQ1 − Q2 k2π . (10.35)

Combining the above results we obtain:
kΠ(T (Q̂1 )) − Π(T (Q̂2 ))kπ ≤ αkQ1 − Q2 k2π , (10.36)
which completes our proof.

Conclusion: The algorithm (10.32) or equivalently (10.33) converges.

10.5.2.4 Unknown MDP Model: Learning

Assume now that the MDP model is unknown. Consider for the moment (10.32) and set
2

1
C(θ; Q̃t+1 ) = Φθ − T (Φθ̂t ) .
2 | {z }
Q̃t+1 π

Then, θ̂t+1 corresponds to the solution of

∇θ C(θ; Q̃t+1 ) = 0.
Performing the calculations
X
πi φT (i)θ − Q̃t+1 (i) φ(i) = 0
i
X h i
or πi φ (i)θ − E c(xk ) + αφT (xk+1 )θ̂t |xk = i φ(i) = 0
T
(10.37)
i
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-19

Idea: Similar to standard Q-learning or SARSA: Try the recursion:

θ̂t+1 = θ̂t + εt c(xt ) + αφT (xt+1 )θ̂t − φT (xt )θ̂t φ(xt ). (10.38)

Note: In practice, to evaluate the estimated Q-function at all state action-pairs, ut = µ(xt ) is combined with some
exploration process. The recursion can be then modified to account for ut using ut+1 = µ(xt+1 ) in an online operation.

10.5.2.5 ODE method

Stochastic approximation suggests that under some conditions if the ODE

θ̇ = −f (θ)

converges to θ∗ as t → ∞, then the corresponding stochastic difference equation also converges to θ? almost surely. In
our case, focusing again on a stationary policy µ:

f (θ) = E φT (xt )θ − c(xt ) − αφT (xt+1 )θ φ(xt ) ,

(10.39)

where the expectation is taken in steady state (cf. (10.37)). The associated stochastic recursion is (10.38). Here, we
have suppressed ut , ut+1 from the notation, since ut = µ(xt ), ut+1 = µ(xt+1 ). Writing f (θ) more explicitly, we have:
 
X X X
πi φT (i)θ − c(i) − Pij φT (j)θ φ(i) = πi φT (i)θ − T (Φθ)(i) φ(i).

f (θ) =
i j i

Let D = diag(π). Then,

f (θ) = ΦT DΦθ − ΦT DT (Φθ).

Now the problem becomes finding the equilibrium θ∗ which satisfies f (θ∗ ) = 0. Solving this equation, we obtain:

ΦT DΦθ∗ − ΦT DT (Φθ∗ ) = 0

or
−1
θ∗ = ΦT DΦ ΦT DT (Φθ∗ ). (10.40)

10.5.2.6 Uniqueness of the Fixed Point Solution

Does the fixed point equation in (10.40) have a unique solution? Indeed, we are going to show that there is a unique
solution θ∗ to this equation.

Proof. We already know that J = Π(T (J)) has a unique solution because Π(T (·)) is a contraction mapping. Recall
that Π(z) solves
1
min ky − zk2π (10.41)
θ:y=Φθ 2
or
1
min kΦθ − zk2π
θ 2
1
or min (Φθ − z)T D(Φθ − z). (10.42)
θ 2
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-20

Let S(θ) = 21 (Φθ − z)T D(Φθ − z). By setting the gradient of S(θ) to zero, we have

∇θ S(θ) = 0
or ΦT D(Φθ − z) = 0
or θ = (ΦT DΦ)−1 ΦT Dz. (10.43)

Multiplying both sides of the last equation by Φ and using the facts that z = T (J) and J = Φθ we obtain:

J = Φθ = Φ(ΦT DΦ)−1 DT (J) = Φ(ΦT DΦ)−1 DT (Φθ). (10.44)

Because J = Π(T (J)) has a unique solution, i.e., equation (10.44) has a unique solution J ∗ = Φθ∗ , the fixed point
equation
θ = (ΦT DΦ)−1 ΦT DT (Φθ) (10.45)
has a unique solution θ∗ (due to the implicit assumption that Φ is full column rank).

Next, we will prove that the ODE θ̇ = −f (θ) converges to θ∗ as t → ∞ using an appropriate Lyapunov function.

10.5.2.7 Convergence of the ODE to the Equilibrium Point

Recall that θ∗ satisfies

ΦT DΦθ∗ − ΦT DT (Φθ∗ ) = 0 (10.46)
or equivalently
f (θ∗ ) = 0. (10.47)
We now have:

θ̇ = −f (θ)
= −f (θ) + f (θ∗ )
= −ΦT DΦ(θ − θ∗ ) + ΦT D(T (Φθ) − T (Φθ∗ )). (10.48)

Consider the following (usual) Lyapunov function:

1 1
L(θ) = kθ − θ∗ k22 = (θ − θ∗ )T (θ − θ∗ ). (10.49)
2 2
Taking the derivative of the Lyapunov function with respect to time, we obtain:

dθ
L̇(θ) = (∇θ L(θ))T
dt
= (∇θ L(θ))T θ̇
= (∇θ L(θ))T [−f (θ) + f (θ∗ )]
= (θ − θ∗ )T [−ΦT DΦ(θ − θ∗ ) + ΦT D(T (Φθ) − T (Φθ∗ ))]
= −(θ − θ∗ )T ΦT DΦ(θ − θ∗ ) + (θ − θ∗ )T ΦT D(T (Φθ) − T (Φθ∗ ))
= −kΦ(θ − θ∗ )k2π + (θ − θ∗ )T ΦT D1/2 D1/2 (T (Φθ) − T (Φθ∗ )). (10.50)

Recall the Cauchy–Schwarz inequality inequality:

|xT y| ≤ kxk2 kyk2 , ∀x, y ∈ Rn . (10.51)

Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-21

We apply this inequality to the second term in (10.50) as follows:

(θ − θ∗ )T ΦT D1/2 D1/2 (T (Φθ) − T (Φθ∗ )) = (θ − θ∗ )T ΦT (D1/2 )T D1/2 (T (Φθ) − T (Φθ∗ ))

= [D1/2 Φ(θ − θ∗ )]T [D1/2 (T (Φθ) − T (Φθ∗ ))]
≤ |[D1/2 Φ(θ − θ∗ )]T [D1/2 (T (Φθ) − T (Φθ∗ ))]|
≤ kD1/2 Φ(θ − θ∗ )k2 kD1/2 (T (Φθ) − T (Φθ∗ ))k2
= kΦ(θ − θ∗ )kπ kT (Φθ) − (Φθ∗ )kπ .

Since T (·) is a contraction mapping with respect to k · kπ , we obtain:

kΦ(θ − θ∗ )kπ kT (Φθ) − (Φθ∗ )kπ ≤ kΦ(θ − θ∗ )kπ αkΦθ − Φθ∗ kπ

= αkΦ(θ − θ∗ )k2π .

As a consequence:
L̇(θ) ≤ −(1 − α)kΦ(θ − θ∗ )k2π < 0 (10.52)
for any θ and L̇(θ) = 0 only when θ = θ∗ . Therefore, we conclude that θ̇ = −f (θ) converges to θ∗ . Thus, we also
conclude that, under some conditions, the associated stochastic recursion (10.38) converges to θ∗ almost surely as
t → ∞.

10.5.2.8 Actor-Critic Algorithm

We have just shown that our approximation scheme above can obtain the Q-function for a fixed policy µ at the state
action pairs (i, µ(i)). Therefore, for an initial policy µ0 , our algorithm can proceed as follows to find an optimal policy:

1. Using the previous scheme with exploration, estimate the Q-function for all state-action pairs when the underlying
stationary policy used is µk . Let Q̂ be the obtained estimate.
2. Next, update the policy to the greedy selection µk+1 (x) = arg minu Q̂(x, u).

3. Repeat.

The above scheme corresponds to an actor-critic algorithm, which performs policy iteration with function approximation.

10.5.3 Q-learning with Function Approximation

Consider again the function class Q = {Qθ = Φθ : θ ∈ RK } for some feature extraction. Then, the Q-learning
algorithm with function approximation is formalized as:

θ̂t+1 = θ̂t + εt c(xt , ut ) + α min φT (xt+1 , v)θ̂t − φT (xt , ut )θ̂t φ(xt , ut ).
v

This iteration coincides with (10.38) when exploration is implemented, except for the definition of the temporal
difference inside the parentheses which includes the minimization minv φT (xt+1 , v)θ̂t (off-policy characteristic)
instead of φT (xt+1 , ut+1 )θ̂t .
Note: Q-learning is not guaranteed to converge when combined with function approximation.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-22

10.6 Model-Free Policy Evaluation Methods

In this section we discuss Monte Carlo policy evaluation and Temporal Difference learning, which are model-free
policy evaluation methods. Policy evaluation algorithms estimate the value function Jµ or Qµ for some given policy µ.
Many problems can be cast as value prediction problems. Estimating the probability of some future event, the expected
time until some event occurs and Jµ or Qµ underlying some policy µ in an MDP are all value prediction problems. In
the context of actor-critic learning and policy evaluation, these are on-policy algorithms, and the considered policy µ
is assumed to be stationary or approximately stationary. Monte-Carlo methods are based on the simple idea of using
sample means to estimate the average of a random quantity. It turns out that the variance of these estimators can be high
and therefore, the quality of the underlying value function estimates can be poor. Moreover, Monte Carlo methods in
closed-loop estimation usually introduce bias. In contrast, TD learning can address these issues. As a final note, TD
learning was introduced by Richard S. Sutton in 1988 and it is widely considered as one of the most influential ideas in
reinforcement learning.

10.6.1 Monte Carlo Policy Evaluation

Monte Carlo methods learn from complete episodes of experience without using bootstrapping, i.e., sampling with
replacement. The corresponding idea is described as follows: Suppose that µ is a stationary policy. Assume that we
wish to evaluate the value function in the discounted scenario
"∞ #
X
µ k
Jµ (i) = E α c(xk , uk ) x0 = i
k=0

or the total cost in an episodial problem

" N
#
X
µ
Jµ (i) = E c(xk , uk ) x0 = i ,
k=0

where the horizon N is a random variable. To evaluate the performance, multiple trajectories of the system can be
simulated using policy µ starting from arbitrary initial conditions up to termination or stopped earlier (truncation of
trajectories). Consider a particular trajectory. After visiting state i at time ti , Jµ (i) is estimated as
Ñ
X Ñ
X
Jˆµ (i) = α k−ti
c(xk , uk ) or Jˆµ (i) = c(xk , uk )
k=ti k=ti

for the discounted or the episodial problem. Here, Ñ is the terminal time instant of the corresponding trajectory. For
the discounted cost problem, Ñ can be chosen large enough to guarantee a small error. Suppose further that we have M
such estimates Jˆµ,1 (i), Jˆµ,2 (i), . . . , Jˆµ,M (i). Then, Jˆµ (i) is finally estimated as
M
1 X ˆ
Jˆµ (i) = Jµ,r (i).
M r=1

Clearly, obtaining such estimates for every i ∈ X corresponds to estimating Jµ .

Modes of Operation: Every state can be visited multiple times at every trajectory that we simulate using µ. This
observation yields several ways of obtaining estimates Jˆµ,r (i) for every i ∈ X.

• Jˆµ (i) is a single estimate of Jµ (i) if the trajectory starts at i and Jˆµ (i) assumes summation up to Ñ . Clearly,
such an approach requires multiple trajectories to be initialized at every state i ∈ X to obtain M estimates
Jˆµ,1 (i), Jˆµ,2 (i), . . . , Jˆµ,M (i) for every i.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-23

• We can obtain an estimate Jˆµ,r (i) each time state i is visited on the same trajectory. Every time the corresponding
summation will be taken up to Ñ . This approach implies that we can obtain multiple estimates Jˆµ,r (i) per
trajectory and across trajectories. Clearly, value estimates of this form from the same trajectory are heavily
correlated.
• A single estimate Jˆµ,r (i) can be at most obtained from any trajectory by performing a summation up to termination
starting at the first visit of i in the course of a particular trajectory.

Result: Jˆµ (i) −−−−→ Jµ (i), ∀i ∈ X for all the above schemes if each state i is visited sufficiently often. This depends
M →∞
on the choice of µ and the initial points of the trajectories.

10.6.2 Temporal Difference Learning for Policy Evaluation

Consider an infinite discounted cost problem and let µ be a stationary policy. Recall the definition of the Tµ operator
defined in earlier lectures:
 
X X
Tµ (J)(i) = E c(x0 , µ(x0 )) + α Px0 j (µ(x0 )))J(j) x0 = i = c̄(i, µ(i)) + α Pij (µ(i))J(j)
j j

or more compactly Tµ (J) = c̄µ + αPµ J. Recall also that Tµ is a contraction mapping with fixed point the value
function Jµ . The fixed policy Value Iteration method computes Jµ via successive approximations by performing the
recursion Jk+1 = Tµ (Jk ) for some initial guess J0 of Jµ . By the contaction property of Tµ , Jk −−−−→ Jµ .
k→∞

Assume now that the underlying MDP model is unknown. We therefore require some form of learning of Jµ in this
case. Let Jˆ0 be an initial guess of Jµ , e.g., Jˆ0 = J0 . Suppose that we simulate the system using policy µ and at times k
and k + 1 we observe xk , c(xk , µ(xk )) and xk+1 (we also observe c(xk+1 , µ(xk+1 ))). Let Jˆk be an estimate of Jµ at
time k. Then, c(xk , µ(xk )) + αJˆk (xk+1 ) is clearly a noisy estimate of Tµ (Jk )(xk ) in the fixed policy Value Iteration
recursion, i.e., of Jk+1 (xk ). Because the aforementioned estimate is noisy, we can use it to update Jˆk (xk ) only by a
sufficiently small amount which guarantees convergence.
Idea: Stochastic Approximation Recursion similar to Q-learning, SARSA, etc.:

Jˆk+1 (xk ) = (1 − εk (xk ))Jˆk (xk ) + εk (xk ) c(xk , µ(xk )) + αJˆk (xk+1 )
 

= Jˆk (xk ) + εk (xk ) c(xk , µ(xk )) + αJˆk (xk+1 ) − Jˆk (xk )

 
| {z }
δk

= Jˆk (xk ) + εk (xk )δk , (10.53)

where δk is the so called temporal difference or temporal difference error. We note here that the update is asyn-
chronous in the sense that if xk = i, then Jˆk+1 (i) = Jˆk (i) + εk δk and Jˆk+1 (j) = Jˆk (j) for all j 6= i. This scheme is
known in the literature as the TD(0) algorithm. If εk (xk ) ≤ 1, then the convex combination in the first line of (10.53)
implies that Jˆk+1 (xk ) is moved by an amount controlled by ε(xk ) towards c(xk , µ(xk )) + αJˆk (xk+1 ) incorporating
new information into the value estimate. Also, since c(xk , µ(xk )) + αJˆk (xk+1 ) incorporates Jˆk (xk+1 ) which is itself
an estimate, i.e., the new information also relies on the current value estimate Jˆk , the method uses bootstrapping.
Convergence properties of TD(0) algorithm:

ε
1. Using the previously mentioned stepsize sequence εk (i) = Nk (i) for some ε > 0, where Nk (i) is the number of
visits to state i by time k and assuming that all states are visited infinitely often, Jˆk −−−−→ Jµ with probability 1.
k→∞
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-24

To elaborate a bit more on this point, we note that if TD(0) converges, then it must converge to some J˜ such that10

E[δk |xk = i, Jˆk = J] ˜ = 0, ∀i ∈ X.

˜ = E[c(xk , µ(xk )) + αJˆk (xk+1 ) − Jˆk (xk )|xk = i, Jˆk = J]

Equivalently, TµP ˜ − J˜ = 0 and since Tµ has a unique fixed point, we conclude that J˜ = Jµ . Moreover, with
(J)
the assumptions k εk (i) = ∞, k ε2k (i) < ∞ and by the ODE method, TD(0) will follow closely trajectories
P
of the linear ODE:
J˙ = c̄µ + (αPµ − I)J.
The eigenvalues of αPµ − I lie in the open left half complex plane11 and hence, the ODE is globally asymp-
totically stable. Therefore, by the ODE method for stochastic approximation, Jˆk −−−−→ Jµ with probability
k→∞
1.
2. If εk = ε > 0 and each state is visited infinitely often, then “tracking” is achieved, i.e., Jˆk will be close to Jµ in
the long run. More precisely, ∀, δ > 0

lim P (|Jˆk − Jµ | > δ) ≤ ,

k→∞

for sufficiently small ε > 0.

TD with `-step look ahead: Instead of δk , one can use

l−1
X l−1
X
αr δk+r = αr c(xk+r , µ(xk+r )) + αl Jˆk (xk+l ) − Jˆk (xk )
r=0 r=0

in (10.53):
l−1
X
Jˆk+1 (xk ) = Jˆk (xk ) + εk (xk ) αr δk+r . (10.54)
r=0

Here, Jˆk (·) is the value function estimate used in δk+r for any r.
TD(λ) Algorithm with 0 ≤ λ ≤ 1: Considering all future temporal differences with a geometric weighting of
decreasing importance we obtain the following algorithm:
∞
X
Jˆk+1 (xk ) = Jˆk (xk ) + εk (xk ) (λα)r δk+r . (10.55)
r=0

Again, Jˆk (·) is the value function estimate used in δk+r for any r.
λ is called trace-decay parameter and it controls the amount of bootstrapping: For λ = 0 we obtain the TD(0)
algorithm and for λ = 1 a Monte Carlo method (or the TD(1) method) for policy evaluation.
Convergence of TD(λ): Similar to TD(0) but often faster than TD(0) and Monte-Carlo methods if λ is judiciously
chosen. This has been experimentally verified, especially when function approximation is used for the value function.
In practice, good values of λ are determined by trial and error. Also, the value of λ can be modified even during the
algorithm without affecting convergence.
10 More generally, this condition should hold at least for all states that are sampled infinitely often.
11 λ − I) = αλi (Pµ ) − 1 and since α|λi (Pµ )| < 1, it turns out that Re{αλi (Pµ ) − 1} < 0.
i (αPµ
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-25

A An explicit view of synchronous Q-learning convergence via the ODE

method

Consider the first equation in the synchronous Q-learning recursion (10.11). Compactly, this recursion can be written in
matrix form as follows:
Q̂k+1 = Q̂k + εk h(Q̂k ) + Mk . (10.56)

Here, Q̂k = [Q̂k (x, u)] is the X × U matrix of Q-value estimates at time k, h(Q̂k ) is the X × U matrix

h(Q̂k ) = [h(Q̂k )(x, u)] = [T (Q̂k )(x, u) − Q̂k (x, u)]

and Mk is the X × U (martingale difference) matrix

 
X
Mk = [Mk (x, u)] = α min Q̂k (xk+1 (x, u), v) − Pxj (u) min Q̂k (j, v) .
v v
j

Suppose that the usual conditions hold:

X X
εk = ∞ and ε2k < ∞.
k k

Then, the associated ODE for the last stochastic matrix recursion is:

Q̇(t) = T (Q)(t) − Q(t).

Recall that T is a contraction mapping with parameter α. It turns out that this ODE has a unique globally stable fixed
point, the synchronous Q-learning algorithm is stable and the stochastic recursion (10.56) converges to the optimal Q,
which is the unique fixed point of T .

Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Average-Reward Model-Free Reinforcement Learning - A Systematic Review and Literature Mapping
No ratings yet
Average-Reward Model-Free Reinforcement Learning - A Systematic Review and Literature Mapping
36 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Jia Zhou - JMLR 2023
No ratings yet
Jia Zhou - JMLR 2023
61 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
No ratings yet
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
28 pages
Business Mathematics Key Concepts of Ratio and Proportion: Quarter 1 Week 3 Module 3
No ratings yet
Business Mathematics Key Concepts of Ratio and Proportion: Quarter 1 Week 3 Module 3
13 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
QP Ans
No ratings yet
QP Ans
40 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Verilog Operators Manual
No ratings yet
Verilog Operators Manual
27 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Sampling Distributions: Section 7.1
100% (1)
Sampling Distributions: Section 7.1
21 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Igcse 0580 Math p4 数学分类真题
No ratings yet
Igcse 0580 Math p4 数学分类真题
1,492 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Methods and Tools Used in Criticality Analysis in Industrial Systems
100% (1)
Methods and Tools Used in Criticality Analysis in Industrial Systems
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
eBook Tặng 1 - Em Tự Tin Vào Lớp 1 Với Mighty Math Singapore - Full - 224 Trang
No ratings yet
eBook Tặng 1 - Em Tự Tin Vào Lớp 1 Với Mighty Math Singapore - Full - 224 Trang
226 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Philosophy 1ST Prelim Notes 1
No ratings yet
Philosophy 1ST Prelim Notes 1
8 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Notes On Mathematics of Quantum Mechanics: Sadi Turgut
No ratings yet
Notes On Mathematics of Quantum Mechanics: Sadi Turgut
56 pages
Object Detection and Ship Classification Using YOLO and Amazon Rekognition
No ratings yet
Object Detection and Ship Classification Using YOLO and Amazon Rekognition
11 pages
Aero Engg Mock Board Exam Mathematics 2014-Answer Keys
No ratings yet
Aero Engg Mock Board Exam Mathematics 2014-Answer Keys
6 pages
The Problem of Punctuation in Modern English
No ratings yet
The Problem of Punctuation in Modern English
18 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Quarter 1-Module 5: Mathematics
100% (1)
Quarter 1-Module 5: Mathematics
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Rational Method With Excel-R1
No ratings yet
Rational Method With Excel-R1
20 pages
Measure of Dispersion
No ratings yet
Measure of Dispersion
11 pages
SURPAC Model Filling
No ratings yet
SURPAC Model Filling
13 pages
C Sharp Logical Test
No ratings yet
C Sharp Logical Test
6 pages
PLUTO: Cloud Assist Parallel Encryption For Data in Mobile Cloud Computing
No ratings yet
PLUTO: Cloud Assist Parallel Encryption For Data in Mobile Cloud Computing
12 pages
Drawing VSWR Circle On Smith Chart
No ratings yet
Drawing VSWR Circle On Smith Chart
19 pages
Trigonometry 2
0% (1)
Trigonometry 2
1 page
BBA Full Syllybus-DBI COLLEGE
No ratings yet
BBA Full Syllybus-DBI COLLEGE
40 pages
Econometrics
No ratings yet
Econometrics
1 page
AS Physics Mechanics Newtons Laws Answers OCR AQA Edexcel Ms
No ratings yet
AS Physics Mechanics Newtons Laws Answers OCR AQA Edexcel Ms
6 pages
Ansys Workbench 13: Theory - Applications - Case Studies
No ratings yet
Ansys Workbench 13: Theory - Applications - Case Studies
4 pages
Support Vector Machine
No ratings yet
Support Vector Machine
21 pages
Unit 1 Lesson 1-5
No ratings yet
Unit 1 Lesson 1-5
24 pages
Project Planning and Approval Worksheet
100% (2)
Project Planning and Approval Worksheet
8 pages
Test Paper 3
No ratings yet
Test Paper 3
8 pages
Vidyapeeth: @icse - 2024 - Materials - Backup
No ratings yet
Vidyapeeth: @icse - 2024 - Materials - Backup
7 pages
Form 1 Term 2 Mathematics SOW 2024
No ratings yet
Form 1 Term 2 Mathematics SOW 2024
4 pages
Oscillations Printed Notes and Assignment
No ratings yet
Oscillations Printed Notes and Assignment
72 pages

Lecture 10

Uploaded by

Lecture 10

Uploaded by

ECE586 MDPs and Reinforcement Learning University of Illinois at Urbana-Champaign

Spring 2019 Acknowledgment: R. Srikant’s Notes and Discussions

Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning

Two major classes of RL methods are3 :

10.1 Q-function and Q-learning

= min c̄(i, u) + αE[J ∗ (xk+1 )|xk = i, uk = u]. (10.1)

kT (Q1 ) − T (Q2 )k∞ ≤ α kQ1 − Q2 k∞ .

min xi − min yi ≤ max |xi − yi | .

Lemma 10.3. For two vectors x and y of equal dimension,

min xi − min yi ≤ max |xi − yi | .

min xi − min yi = min xi − min yi = min xi − yi1 ≤ xi1 − yi1 |{z}

(∗) is due to the fact that yi1 = mini yi ≤ mini xi ≤ xi1 .

E [h(Q)(xk , uk , xk+1 )|xk = i, uk = u] = 0, ∀(i, u) ∈ X × U (10.6)

Q̂k+1 (xk , uk ) = Q̂k (xk , uk ) + εk (xk , uk )h(Q̂k )(xk , uk , xk+1 )

Q-learning Algorithm (Two time-scale implementation)

and for (x, v) 6= (i, u):

10.1.2 An Illustrative Case: Deterministic Q-learning

Consider a deterministic MDP model:

Q(i, u) = c(i, u) + αJ ∗ (f (i, u)), ∀(i, u) ∈ X × U.

Q(i, u) = c(i, u) + α min Q(f (i, u), v), ∀(i, u) ∈ X × U.

Q-learning in this setup is implemented as follows:

Deterministic Q-learning Algorithm

= α min Q̂k (xk+1 , v) − min Q(xk+1 , v)

≤ α max Q̂k (xk+1 , v) − Q(xk+1 , v)

≤ α max Q̂k (x, v) − Q(x, v) = α∆k .

10.1.3 Used policy during learning

10.2 Convergence of Q-learning

1. The state space is finite.

For w = [1, . . . , 1]T the usual maximum norm is recovered.

uniformly with probability 1 and Var(c(x, u)) is bounded.

E[Fk (x, u)|Fk ] = T (Q̂k )(x, u) − Q(x, u),

E[Fk (x, u)|Fk ] = T (Q̂k )(x, u) − T (Q)(x, u).

10.3 Boltzmann and -Greedy Explorations

min Q̂k (xk , v),

Boltzmann GLIE exploration:

Note: Classification of MDPs:

10.5.1 Function approximation

10.5.2 Actor-Critic Learning with Function Approximation

• Set t = 0 and use an initial guess θ̂0 .

• Set t ← t + 1, update θ̂t as

for some initialization θ̂0 .

T (Q̂)(x) = T (Q̂)(x, µ(x)) = E[c(xk , uk ) + aQ̂(xk+1 , µ(xk+1 ))|xk = x, uk = µ(x)]. (10.22)

Figure 10.1: Two-state Markov chain.

= 0 + αP11 Q̂(1) + αP12 Q̂(2)

= 0 + αP21 Q̂(1) + αP22 Q̂(2)

with Φ = [1, 2]T . Clearly, Q ∈ Q, since Q = Φ · 0. We now note that:

Differentiating with respect to θ and setting the derivative to 0, we obtain

α(2 − ε)(w1 + 2w2 )

α(2 − ε)(ε + 2(1 − ε))

With W = diag(π), the algorithm becomes:

10.5.2.3 Convergence Proof for (10.32)

Denote by Π the projection operator defined by the optimization problem:

Proof. First, note that

= αkQ1 − Q2 k2π . (10.35)

Conclusion: The algorithm (10.32) or equivalently (10.33) converges.

10.5.2.4 Unknown MDP Model: Learning

Then, θ̂t+1 corresponds to the solution of

Idea: Similar to standard Q-learning or SARSA: Try the recursion:

10.5.2.5 ODE method

Stochastic approximation suggests that under some conditions if the ODE

f (θ) = E φT (xt )θ − c(xt ) − αφT (xt+1 )θ φ(xt ) ,

Let D = diag(π). Then,

f (θ) = ΦT DΦθ − ΦT DT (Φθ).

10.5.2.6 Uniqueness of the Fixed Point Solution

J = Φθ = Φ(ΦT DΦ)−1 DT (J) = Φ(ΦT DΦ)−1 DT (Φθ). (10.44)

10.5.2.7 Convergence of the ODE to the Equilibrium Point

Recall that θ∗ satisfies

Consider the following (usual) Lyapunov function:

Recall the Cauchy–Schwarz inequality inequality:

|xT y| ≤ kxk2 kyk2 , ∀x, y ∈ Rn . (10.51)

We apply this inequality to the second term in (10.50) as follows:

(θ − θ∗ )T ΦT D1/2 D1/2 (T (Φθ) − T (Φθ∗ )) = (θ − θ∗ )T ΦT (D1/2 )T D1/2 (T (Φθ) − T (Φθ∗ ))

Since T (·) is a contraction mapping with respect to k · kπ , we obtain:

kΦ(θ − θ∗ )kπ kT (Φθ) − (Φθ∗ )kπ ≤ kΦ(θ − θ∗ )kπ αkΦθ − Φθ∗ kπ

10.5.2.8 Actor-Critic Algorithm

10.5.3 Q-learning with Function Approximation

10.3 Boltzmann and -Greedy Explorations

lim P (|Jˆk − Jµ | > δ) ≤ ,