Lecture10
Lecture10
Previously, we have considered solving MDPs when the underlying model, i.e., the transition probabilities and the cost
structure are completely known. If the MDP model is unknown or if it is known but it is computationally infeasible
to be used directly except through sampling due to the domain size, then simulation-based stochastic approximation
algorithms for estimating the optimal policy of MDPs can be employed. Here, “simulation-based” means that the
agent/controller can observe the system trajectory under any choice of control actions and therefore, trajectory sampling
from the MDP is performed. We will study two algorithms in this case:
(1) Q-learning, studied in this lecture: It is based on the Robbins–Monro algorithm (stochastic approximation (SA))
to estimate the value function for an unconstrained MDP. A primal-dual Q-learning algorithm can be employed
for MDPs with monotone optimal policies. The Q-learning algorithm also applies as a suboptimal method for
POMDPs.
(2) Policy gradient algorithms, which we will see in later lectures: Such algorithms rely on parametric policy
classes, e.g., on the class of Gibbs policies. They employ gradient estimation of the cost function together with
a stochastic gradient algorithm on the performance surface induced by the selected smoothly parameterized
policy class M = {µθ : θ ∈ Rd } of stochastic stationary policies to estimate the optimal policy. Policy gradient
algorithms apply to MDPs and constrained MDPs, while they yield suboptimal policy search methods for
POMDPs.
Note: Determining the optimal policy of an MDP (or a POMDP) when the model parameters are unknown correspond
to stochastic adaptive control problems. Stochastic adaptive control algorithms are of two types: direct methods,
where the unknown MDP model is estimated simultaneously with updating the control policy, and implicit methods
such as simulation-based methods, where the underlying MDP model is not directly estimated in order to compute the
control policy1 . Q-learning, Temporal Difference (TD) learning and policy gradient algorithms correspond to such
simulation-based methods. Such methods are also called reinforcement learning algorithms.
Reinforcement Learning: Also called neuro-dynamic programming or approximate dynamic programming2 . The first
term is due to the use of neural networks with RL algorithms. Reinforcement learning is a branch of machine learning.
It corresponds to learning how to map situations or states to actions or equivalently to learning how to control a system
in order to minimize or to maximize a numerical performance measure that expresses a long-term objective. The agent
is not told which actions to take, but instead must discover which actions yield the most reward or the least cost by
trying them. Actions may affect the immediate reward or cost and the next situation or state. Thus, actions influence all
subsequent rewards or costs. These two characteristics, i.e., the trial-and-error search and the delayed rewards or costs,
are the two most distinguishing features of reinforcement learning. The main differences of reinforcement learning
from other machine learning paradigms are summarized below:
1 Often in the literature, the terms “direct” and “implicit or indirect” learning are used with reverse associations to methods. With the term “direct
learning” several authors refer to simulation-based methods and the “directness” corresponds to “directly, without estimating an environmental
model”. Similarly, “indirect learning” is used for methods estimating first a model for the environment and then computing an optimal policy via
“certainty equivalence”. As an additional comment of independent interest, it is well-known in adaptive control theory that the certainty equivalence
principle may lead to suboptimal performance due to the lack of exploration.
2 Consider the very rich field known as approximate dynamic programming. Neuro-Dynamic Programming is mainly a theoretical treatment of the
field using the language of control theory. Reinforcement Learning describes the field from the perspective of artificial intelligence and computer
science. Finally, Approximate Dynamic Programming uses the parlance of operations research, with more emphasis on high dimensional problems
that typically arise in this community.
10-1
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-2
(a) There is no supervisor, only a reward or a cost signal which reinforces certain actions over others.
(b) The feedback is typically delayed.
(c) The data are sequential and therefore, time is critical.
(d) The actions of the agent affect the (subsequent) data generation mechanism.
(i) Policy-Iteration based algorithms or “actor-critic” learning: As an initial comment, actor-critic learning is
the (generalized) learning analogue for the Policy Iteration method of Dynamic Programming (DP), i.e., the
corresponding approach that is followed in the context of reinforcement learning due to the lack of knowledge
of the underlying MDP model and possibly due to the use of function approximation if the state-action space
is large. More specifically, actor-critic methods implement Generalized Policy Iteration. Recall that policy
iteration alternates between a complete policy evaluation and a complete policy improvement step. If sample-
based methods or function approximation for the underlying value function or the Q-function4 are employed,
exact evaluation of the policies may require infinite many samples or might be impossible due to the function
approximating class. Hence, RL algorithms simulating policy iteration must change the policy by relying on
partial or incomplete knowledge of the associated value function. Such schemes are said to implement generalized
policy iteration.
To make the discussion more concrete, recall the Policy Iteration method. For an initial policy µ0 = µ the
following iteration is set:
• Policy Evaluation: Compute Jµk = (I − αPµk )−1 c̄µk and let Qµk (i, u) = c̄(i, u) + α j Pij (u)Jµk (j).
P
• Policy
P Improvement: Update µk to the greedy policy associated with Qµk , i.e., to µk+1 (i) = arg minu c̄(i, u)+
α j Pij (u)Jµk (j) = arg minu Qµk (i, u) for all i ∈ X.
We now want to perform the above two steps without access to the true dynamics and reward or cost structure.
Assume that an initial policy µ0 = µ is chosen. The corresponding iteration implemented by an actor-critic
method is subdivided as follows:
• Policy Evaluation (“Critic”): Estimate the value of the current policy of the actor or Qµk by performing a
model-free policy evaluation. This is a value prediction problem.
• Policy Improvement (“Actor”): Update µk based on the estimated Qµk from the previous step.
A comment: As it is clear from the previous description, the actor performs policy improvement. This
improvement can be implemented in a similar spirit as in policy iteration by moving the policy towards the
greedy policy underlying the Q-function estimate obtained from the critic. Alternatively, policy gradient can
be performed directly on the performance surface underlying a chosen parametric policy class. Moreover, the
actor performs some form of exploration to enrich the current policy, i.e., to guarantee that all actions are tried.
Roughly speaking, the exploration process guarantees that all state-action pairs are sampled sufficiently often.
Exploration of all actions available at a particular state is important, even if they might be suboptimal with respect
to the current Q-estimate. Moreover, the policy evaluation step produces an estimate of Qµk . Clearly, a point of
vital importance is that the policy improvement step monotonically improves the policy as in the model-based
case.
Methods for policy evaluation (“critic”) include:
• Monte Carlo policy evaluation.
• Temporal Difference methods: TD(λ), SARSA, etc.
3 These classes fall within the more general class of value-function based schemes.
4 or both sample-based methods with function approximation
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-3
(ii) Value-Iteration based algorithms: Such approaches are based on some online version of value iteration
Jˆk+1 (i) = minu c̄(i, u) + a j Pij (u)Jˆk (j), ∀i ∈ X. The basic learning algorithm in this class is Q-learning.
P
The aim of Q-learning is to approximate the optimal action-value function Q by generating a sequence {Q̂k }k≥0
of such functions. The underlying idea is that if Q̂k is “close” to Q for some k, then the corresponding greedy
policy with respect to Q̂k will be close to the optimal policy µ∗ which is greedy with respect to Q.
The Q-learning algorithm is a widely used model-free reinforcement learning algorithm. It corresponds to the
Robbins–Monro stochastic approximation algorithm applied to estimate the value function of Bellman’s dynamic
programming equation. It was introduced in 1989 by Christopher J. C. H. Watkins in his PhD Thesis. A convergence
proof was presented by Christopher J. C. H. Watkins and Peter Dayan in 1992. A more detailed mathematical proof
was given by John Tsitsiklis in 1994, and by Dimitri Bertsekas and John Tsitsiklis in their book on Neuro-Dynamic
Programming in 1996. As a final comment, although Q-learning is a cornerstone of the RL field, it does not really scale
to large state-control spaces. Large state-control spaces associated with complex problems can be handled by using
state aggregation or approximation techniques for the Q-values.
Recall that Bellman’s dynamic programming equation for a discounted cost MDP is:
X
J ∗ (i) = min c̄(i, u) + α Pij (u)J ∗ (j)
u∈U
j
Definition 10.1. (Q-function or state-action value function) The (optimal) Q-function is defined by
X
Q(i, u) = c̄(i, u) + α Pij (u)J ∗ (j), i ∈ X, u ∈ U. (10.2)
j
If the Q-function is known, then J ∗ (i) = minu Q(i, u) and the optimal policy is greedy with respect to Q i.e.,
µ∗ (i) = arg minu Q(i, u). Combining the previous results, we obtain:
X
Q(i, u) = c̄(i, u) + α Pij (u) min Q(j, v)
v
j
h i
= c̄(i, u) + αE min Q(xk+1 , v)|xk = i, uk = u . (10.3)
v
By abusing notation, we will denote by T the operator in the right-hand side of the previous equation. Then, (10.3) can
be alternatively written as:
Q = T (Q) , (10.4)
where
X
T (Q)(i, u) = c̄(i, u) + α Pij (u) min Q(j, v)
v
j
h i
= E c(xk , uk ) + α min Q(xk+1 , v)|xk = i, uk = u . (10.5)
v
Theorem 10.2. T is a contraction mapping (with respect to k·k∞ ) with parameter α < 1, i.e.
Proof.
X
(T (Q1 ) − T (Q2 ))(i, u) = α Pij (u) min Q1 (j, v) − min Q2 (j, v) .
v v
j
It follows that X
|(T (Q1 ) − T (Q2 ))(i, u)| ≤ α Pij (u) min Q1 (j, v) − min Q2 (j, v)
v v
j
X
≤α Pij (u) max |Q1 (j, v) − Q2 (j, v)|
v
j
X
≤α Pij (u) max |Q1 (r, v) − Q2 (r, v)|
r,v
j
X
= α max |Q1 (r, v) − Q2 (r, v)| Pij (u)
r,v
j
= α kQ1 − Q2 k∞ ,
which leads to kT (Q1 ) − T (Q2 )k∞ ≤ α kQ1 − Q2 k∞ since the obtained bound is valid for any (i, u) ∈ X × U. We
further note that the second inequality is due to the fact that for two vectors x and y of equal dimension:
Proof. Assume without loss of generality that mini xi ≥ mini yi and that i1 = arg mini yi . Then:
Note: For two vectors x and y of equal dimension, |maxi xi − maxi yi | ≤ maxi |xi − yi |. This inequality is useful in
the case of a problem with rewards. PIt can be used to show the contraction property of the corresponding T operator
defined as T (Q)(i, u) = r̄(i, u) + α j Pij (u) maxv Q(j, v) in this case.
Since T is a contraction mapping, we can use value iteration to compute Q(i, u) if the MDP model is known. In
practice, the MDP model is unknown. This problem is addressed via Q-learning.
10.1.1 Q-learning
By now, it should be clear that Q-learning means learning the Q-function. In the following, we introduce two versions
of Q-learning: synchronous and asynchronous. Before this, we examine the rationale behind Q-learning and its
connection with the fundamental Robbins-Monro algorithm.
Observe that (10.1) has an expectation inside the minimization, while (10.3) has an expectation outside the minimization.
This crucial observation forms the basis for using stochastic approximation algorithms to estimate the Q-function. Note
that (10.3) can be written as:
where
h(Q)(xk , uk , xk+1 ) = c(xk , uk ) + α min Q(xk+1 , v) − Q(xk , uk ). (10.7)
v
The Robbins-Monro algorithm can be used to estimate the solution of (10.6) via the recursion:
or
Q̂k+1 (xk , uk ) = (1 − εk (xk , uk ))Q̂k (xk , uk ) + εk (xk , uk ) c(xk , uk ) + α min Q̂k (xk+1 , v) . (10.9)
v
Note: As we will see later on, the term h(Q)(xk , uk , xk+1 ) is very similar to the temporal difference in TD schemes
for policy evaluation (specifically, to TD(0)), except for the minimization operation applied to Q̂(xk+1 , v). Defining the
temporal difference to incorporate the minimization (or maximization for rewards) operator, Q-learning corresponds to
an instance of temporal difference learning.
Remark: It turns out that the temporal differences underlying Q-learning do not telescope. This is due to the fact
that Q-learning is an inherently off-policy algorithm. We clarify more the notion of off-policy algorithms after the
deterministic Q-learning example that we provide in the following.
The
P decreasing stepsizePsequences (or sequences of learning rates) {εk (i, u)}k≥0 for any (i, u) ∈ X × U must satisfy
2
ε
k k (i, u) = ∞ and ε
k k (i, u) < ∞ in the context of stochastic approximation. These constraints are also called
Robbins-Monro conditions. A possible choice is
ε
εk (i, u) = , (10.10)
Nk (i, u)
where ε > 0 is a constant and Nk (i, u) is the number of times the state-action pair (i, u) has been visited until time k
by the algorithm. The algorithm is summarized as a two-timescale stochastic approximation algorithm in the following
table:
This two-time scale implementation of Q-learning with policy updates during an infinitely long trajectory appears in
the literature, but it is not the standard form of this algorithm.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-6
Remarks:
1. Fast Time Scale: The Q-function is updated by applying the same policy for a fixed period T̄ of time slots referred
to as the update interval.
2. Slow Time Scale: Policy update.
3. Simulate (or sample) the next state as xk+1 ∼ Pxk ,xk+1 (uk ): The Q-learning algorithm does not require explicit
knowledge of the transition probabilities, but simply access to the controlled system to measure its next state
xk+1 when an action uk is applied. Via the same access to the controlled system, the cost signal c(xk , uk ) is
observed.
4. Under some conditions, the Q-learning algorithm for a finite state MDP converges almost surely to the optimal
solution of Bellman’s equation.
Note: The above implementation explicitly describes a way such that the observed trajectory is generated. Clearly,
updated policies participate in this trajectory formation.
In the literature, instead of the previous two-time scale form which incorporates policy update steps by considering the
greedy policies with respect to the underlying Q-function estimates at the end of update intervals, Q-learning is usually
presented in the following two variations, where the generation of the underlying system trajectory is not explicitly
specified:
Synchronous Q-learning: At time k + 1, we update the Q-function as
Q̂k+1 (i, u) = Q̂k (i, u) + εk c(i, u) + α min Q̂k (j, v) − Q̂k (i, u)
v
= (1 − εk )Q̂k (i, u) + εk c(i, u) + α min Q̂k (j, v) ∀(i, u) ∈ X × U. (10.11)
v
This version is known as synchronous Q-learning because the update is taken for all state-action pairs (i, u) per
iteration. Here, xk+1 = j when uk = u is applied to state xk = i. In other words, at stage k, a random variable
xk+1 = xk+1 (i, u) is simulated with probability Pi· (u) to implement the update of the Q-value for each state action-pair
(i, u).
Asynchronous Q-learning: Suppose that xk = i. With probability pk we choose uk = arg minu Q̂k (xk , u) and with
probability (1 − pk ) we choose any action uniformly at random. This ensures that all state-action pairs (i, u) are
explored instead of just exploiting the current knowledge of Q̂k . At time k + 1, we update the Q-function as
Q̂k+1 (i, u) = (1 − εk (i, u))Q̂k (i, u) + εk (i, u) c(i, u) + α min Q̂k (j, v)
v
In practice, asynchronous Q-learning is used. Moreover, the term “Q-learning” is reserved almost exclusively for
asynchronous Q-learning.
Stochastic Approximation Rationale in relevance to (10.4): We want to solve Q = T (Q). Borrowing the idea from
stochastic approximation, we can use the iteration
r̂k+1 = (1 − εk )r̂k + εk (T (r̂k ) + noise),
to solve an equation of the form r = T (r) or T (r) − r = 0.
Policy Selection: For the described synchronous and asynchornous Q-learning schemes, either we theoretically wait
until convergence and then we choose the greedy policy with respect to the limit, or we stop the iteration earlier, at
some Q̂k and we extract the corresponding greedy policy as our control choice.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-7
xk+1 = f (xk , uk )
ck = c(xk , uk ).
Let the performance metric for some policy µ be the discounted cost objective:
∞
X
Jµ (i) = αk ck , x0 = i, uk = µ(xk ).
k=0
Then, J ∗ (i) = minµ Jµ (i), ∀i ∈ X. Moreover, the Q-function is defined is this case as:
Clearly, J ∗ (i) = minu Q(i, u). Moreover, the above definition can be given only in terms of the Q-function as follows:
Note: This algorithm relies on a particular system trajectory. However, no reference on how to choose the actions is
made.
Theorem 10.4. Consider the previous deterministic MDP model and the described Q-learning algorithm for this
scenario. If each state-action pair is visited infinitely often, then Q̂k (i, u) −−−−→ Q(i, u) for every state-action pair
k→∞
(i, u) ∈ X × U.
Proof. Let ∆k = kQ̂k − Qk∞ = maxi,u Q̂k (i, u) − Q(i, u) . Then, at every time instant k + 1 we have:
Q̂k+1 (i, u) − Q(i, u) = ck + α min Q̂k (xk+1 , v) − ck + α min Q(xk+1 , v)
v v
Consider now any time interval {k1 , k1 + 1, . . . , k2 } in which each state-action pair is visited at least once. Then, by
the previous derivation:
∆k2 ≤ α∆k1 ,
with an elementwise interpretation of the inequality. Since as k → ∞ there are infinite many such intervals and α < 1,
we conclude that ∆k → 0 as k → ∞.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-8
Remark: Guaranteeing that each state-action pair is visited infinitely often ensures that as k → ∞ there are infinite
many time intervals {k1 , k1 + 1, . . . , k2 } such that each state-action pair is visited at least once. To ensure this
requirement, some form of exploration is often used. The speed of convergence may depend critically on the efficiency
of the exploration.
Key Aspect: As mentioned before, there is no reference to the policy used to generate the system trajectory, i.e., an
arbitrary policy may be used for this purpose. For this reason, Q-learning is an off-policy algorithm. An alternative
way to explain the term “off-policy learning” is learning a policy by following another policy. A distinguishing
characteristic of off-policy algorithms like Q-learning is the fact that the update rule need not have any relation to the
underlying learning policy generating the system trajectory. The updates of Q̂k (xk , uk ) in the Deterministic Q-learning
algorithm and in (10.8)-(10.9) depend on minv Q̂k (xk+1 , v), i.e., the estimated Q-functions are updated on the basis
of hypothetical actions or more precisely actions other than those actually executed. On the other hand, on-policy
algorithms learn the underlying policy as well. Moreover, on-policy algorithms update value functions strictly on the
basis of the experience gained from executing some (possibly nonstationary) policy.
Q-learning is a fairly simple algorithm to implement. Additionally, it permits the use of arbitrary policies to generate
the training data provided that in the limit, all state-action pairs are visited and therefore are updated infinitely often.
Action sampling strategies to achieve this requirement in closed-loop learning are -greedy action selection schemes or
Boltzmann exploration that we provide in the sequel5 . More generally, any persistent exploration method will ensure
the aforementioned target requirement. With appropriate tuning, asymptotic consistency can be achieved.
The Q-learning algorithm can be viewed as a stochastic process to which techniques of stochastic approximation
can be applied. The following proof of convergence relies on an extension of Dvoretzky’s (1956) formulation of the
classical Robbins-Monro (1951) stochastic approximation theory to obtain a class of converging processes involving
the maximum norm.
Theorem 10.5. A random iterative process ∆n+1 (x) = (1 − an (x))∆n (x) + bn (x)Fn (x) converges to zero with
probability 1 under the following assumptions:
Here, Fn = {∆n , ∆n−1 , . . . , Fn−1 , . . . , an−1 , . . . , bn−1 , . . .} stands for the history at step n. Fn (x), an (x) and bn (x)
are allowed to depend on the past insofar as the above conditions remain valid. Finally, k · kw denotes some weighted
maximum norm.
5 We have seen both -greedy schemes and Boltzmann exploration in the context of multi-armed bandit problems in a previous lecture file.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-9
Weighted Maximum Norm: Let w = [w1 , . . . , wn ]T ∈ Rn be a positive vector. Then, for a vector x ∈ Rn the
following weighted max-norm based on w can be defined:
xi
kxkw = max . (10.12)
1≤i≤n wi
This norm induces a matrix norm. For a matrix A ∈ Rn×n the following weighted max-norm based on w can be
defined:
kAkw = max {kAxkw |x ∈ Rn } . (10.13)
kxkw =1
converges with probability 1 to the optimal Q-function, i.e., the one with values given by (10.2), as long as
X X
εk (x, u) = ∞ and ε2k (x, u) < ∞, ∀(x, u) ∈ X × U (10.15)
k k
Note: Under the usual consideration that εk (xk , uk ) ∈ [0, 1), (10.15) requires that all state-action pairs are visited
infinitely often.
Proof. Let
∆k (x, u) = Q̂k (x, u) − Q(x, u),
where Q(x, u) is given by (10.2). Subtracting from both sides of (10.14) Q(xk , uk ), we obtain:
∆k+1 (xk , uk ) = (1 − εk (xk , uk ))∆k (xk , uk ) + εk (xk , uk )
c(xk , uk ) + α min Q̂k (xk+1 , v) − Q(xk , uk )
v
| {z }
Fk (xk ,uk )
(10.16)
by defining
Fk (x, u) = c(x, u) + α min Q̂k (X(x, u), v) − Q(x, u),
v
where X(x, u) is a random sample state obtained by the Markov chain with state space X and transition matrix
P (u) = [Pij (u)]. Therefore, the Q-learning algorithm has the form of the process in the previous theorem with
an (x) = bn (x) ← εn (x, u). It is now easy to see that
where T is given by (10.5). Using the fact that T is a contraction mapping and Q = T (Q), we further have:
Therefore,
kE[Fk (x, u)|Fk ]k∞ = kT (Q̂k ) − T (Q)k∞ ≤ αkQ̂k − Qk∞ = αk∆k k∞ .
Finally,
h i
2
Var(Fk (x, u)|Fk ) = E (Fk (x, u) − E[Fk (x, u)|Fk ]) |Fk
2
= E c(x, u) + α min Q̂k (X(x, u), v) − T (Q̂k )(x, u) |Fk
v
= Var c(x, u) + α min Q̂k (X(x, u), v)|Fk ≤ C(1 + k∆k k∞ )2 ,
v
where the last step is due to the (at most) linear dependence of the argument on Q̂k (X(x, u), v) and the underlying
assumption that the variance of c(x, u) is bounded.
1
Asymptotic Convergence Rate of Q-learning: For discounted MDPs with discount factor
q 2 < α < 1, the asymptotic
1 1 log log k
if δ(1 − α) ≥ 12 , provided that
rate of convergence of Q-learning is O kδ(1−α) if δ(1 − α) < 2 and O k
pmin
the state-action pairs are sampled from a fixed probability distribution. Here, δ = pmax is the ratio of the minimum and
maximum state-action occupation frequencies.
Remark: In the context of function approximation discussed in subsequent sections, we refer to the so-called ODE
method to show convergence of stochastic recursions. In the end of this file, we provide a brief appendix with a
comment on explicitly looking into Q-learning convergence via the ODE method. For simplicity, we focus there on
the convergence of the synchronous Q-learning algorithm. A similar comment can be made for the asynchronous
Q-learning algorithm as well.
Boltzmann exploration chooses the next action according to the following probability distribution:
exp − Q̂k (x,u)
T
p(uk = u|xk = x) = P . (10.17)
Q̂k (x,v)
v exp − T
T is called temperature. In the literature, Boltzmann exploration is also known as softmax approximation (when rewards
are employed instead of costs and the sign in the exponents is plus instead of minus). In statistical mechanics, the
softmax function is known as the Boltzmann or Gibbs distribution. In general, the softmax function has the following
form,
exp(βyi )
σ(y; β)i = Pn , ∀i = 1, ..., n, y = (y1 , ..., yn ) ∈ Rn , β > 0. (10.18)
j=1 exp(βy j )
It is easy to check that the softmax function defines a probability mass function (pmf) over the index set {1, ..., n}.
For β = 0 (or T = ∞), all indices are assigned an equal mass, i.e., the underlying pmf is uniform. As β increases,
indices corresponding to larger elements in y are assigned higher probabilities. In this sense, “softmax” is a smooth
approximation of the “max” function. When β → +∞, we have that
exp(βyi )
lim σ(y; β)i = lim Pn = 1{i=arg maxj yj } .
β→+∞ β→+∞ j=1 exp(βyj )
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-11
In other words, the softmax function assigns all the mass to the maximum element of y as β → +∞. We note here that
this holds if the maximum element of y is unique, otherwise all the mass is assigned (uniformly) to the set of maxima
of y.
In the form of (10.17), uk becomes the solution of
i.e., the policy becomes greedy as T → 0 (In the form of (10.18), limβ→−∞ σ(y; β)i = 1{i=arg minj yj } , if the minimum
element of y is unique, otherwise all the mass is assigned (uniformly) to the set of minima of y).
In practice, depending on the learning goal, we may choose a temperature schedule such that Tk → 0 as k → ∞, which
guarantees that arg minv Q̂k (xk , v) is used as control in the limit k → ∞. This corresponds to a decaying exploration
scheme. We can also choose T to have a constant value. Constant temperature Boltzmann exploration corresponds to a
persistent exploration scheme. We discuss decaying and persistent exploration shortly.
In -greedy exploration, uk = arg minv Q̂k (xk , v) is used with probability 1 − and with probability any action
chosen at random is taken. This corresponds to a persistent exploration scheme. The value of can be reduced over
time, gradually moving the emphasis from exploration to exploitation. The last scheme is a paradigm of decaying
exploration.
Remarks:
1. Although the above exploration methods are often used in practice, they are local schemes, i.e., they rely on the
current state and therefore convergence is often slow.
2. More broadly, there are schemes with a more global view of the exploration process. Such schemes are designed
to explore “interesting” parts of the state space. We will not pursue such schemes here.
Decaying and Persistent Exploration: We now clarify the difference between decaying exploration and persistent
exploration schemes that exist in the literature. Decaying exploration schemes become over time greedy for choosing
actions, while persistent exploration schemes do not. The advantage of decaying exploration is that the actions taken by
the system may converge to the optimal ones eventually, but with the price that their ability to adapt slows down. On
the contrary, persistent exploration methods can retain their adaptivity forever, but with the price that the actions of the
system will not converge to optimality in the standard sense.
A class of decaying exploration schemes that is of interest in the sequel is defined as follows:
Definition 10.7. (Greedy in the Limit with Infinite Exploration) (GLIE): Consider exploration schemes satisfying
the following properties:
• Each action is visited infinitely often in every state that is visited infinitely often.
• In the limit, the chosen actions are greedy with respect to the learned Q-function with probability 1.
The first condition requires that exploration is performed indefinitely. The second condition requires that the exploration
is decaying over time and the emphasis is gradually passed to exploitation.
-greedy GLIE exploration: Let xk = i. Pick the corresponding greedy action with probability 1 − k (i) and an
action at random with probability k (i). Here, k (i) = Nkν(i) , 0 < ν < 1 and Nk (i) is the number of visits to the
current state xk = i by time k.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-12
where
log Nk (x)
βk (x) = and Ck (x) ≥ max |Q̂k (x, v) − Q̂k (x, v 0 )|.
Ck (x) v,v 0
10.4 SARSA
As we already mentioned, in Q-learning, the policy used to estimate Q is irrelevant as long as the state-action space is
adequately explored. In particular, to obtain Q̂k+1 (xk , uk ) we use minv Q̂k (xk+1 , v) in the Q-update instead of the
actual control used in the next step. SARSA (State–Action–Reward–State–Action) is another scheme for updating the
Q−function. The corresponding update in this case is:
Q̂k+1 (xk , uk ) = (1 − εk (xk , uk ))Q̂k (xk , uk ) + εk (xk , uk ) c(xk , uk ) + αQ̂k (xk+1 , uk+1 ) (10.20)
and Q̂k+1 (x, u) = Q̂k (x, u) for all (x, u) 6= (xk , uk ). Note that Q̂k (xk+1 , uk+1 ) replaces minv Q̂k (xk+1 , v) in (10.9).
This scheme also converges if the greedy policy uk+1 = arg minv Q̂k+1 (xk+1 , v) is chosen as k → ∞. In this case,
due to this convergence,
lim min Q̂k+1 (xk+1 , v) = lim min Q̂k (xk+1 , v)
k→∞ v k→∞ v
for any fixed state xk+1 = j and therefore, SARSA and Q-learning iterations will coincide in the limit.
Note: SARSA and Q-learning iterations coincide if a greedy learning policy is applied (i.e., always the greedy action
with respect to the current Q-estimate is chosen).
Remarks:
1. SARSA is an on-policy scheme because in the update (10.20) the action uk+1 that was taken according to
the underlying policy is used. Compare this update with the off-policy Q-learning update in (10.9), where the
Q-values are updated using the greedy action corresponding to minv Q̂k (xk+1 , v).
2. As we will see later on, when the underlying policy µ is fixed, SARSA is equivalent to TD(0) applied to
state-action pairs.
3. The rate of convergence of SARSA coincides with the rate of convergence of TD(0) and is O √1k . This
is due to the fact that these algorithms are standard linear stochastic approximation methods. Nevertheless,
the corresponding constant in this rate of convergence will be heavily influenced by the choice of the stepsize
sequence, the underlying MDP model and the discount factor α.
4. SARSA can be extended to a multi-step version known as SARSA(λ). The notion of a multi-step extension will
be clarified later on, in our discussion on TD learning and specifically on the TD(λ) algorithm. We will not
pursue this concept any further here.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-13
Convergence of SARSA: As mentioned earlier, for off-policy methods like Q-learning, the only requirement for
convergence is that each state-action pair is visited infinitely often. For on-policy schemes like SARSA which learn the
Q-function of the underlying learning policy, exploration has to eventually become small to ensure convergence to the
optimal policy6 .
Theorem 10.8. In finite state-action MDPs, the SARSA estimate Q̂k converges to the optimal Q-function and the
underlying learning policy µk converges to an optimal policy µ∗ with probability 1 if the exploration scheme (or
equivalently the learning policy) is GLIE and the following additional conditions hold:
2
P P
1. The learning rates satisfy 0 ≤ εk (i, u) ≤ 1, k εk (i, u) = ∞, k εk (i, u) < ∞ and εk (i, u) = 0 unless
(xk , uk ) = (i, u).
2. Var(c(i, u)) < ∞ (or Var(r(i, u)) < ∞ for a problem with rewards).
3. The controlled Markov chain is communicating: every state can be reached from any other with positive
probability (under some policy).
• Recurrent or Ergodic: The Markov chain corresponding to every deterministic stationary policy consists of a
single recurrent class.
• Unichain: The Markov chain corresponding to every deterministic stationary policy consists of a single recurrent
class plus a possibly empty set of transient states.
• Communicating: For every pair of states (i, j) ∈ X × X, there exists a deterministic stationary policy under
which j is accessible from i.
10.5 Q-approximation
Consider finite state-action spaces. If the size of the table representation of the Q-function is very large, then the
Q-function can be appropriately approximated to cope with this problem. Function approximation can be also applied
in the case of infinite state-action spaces.
Generic (Value) Function Approximation: To demonstrate some ideas, we first consider function approximation
for a generic function J : X → R. Let J = Jθ : θ ∈ RK be a family of real-valued functions on the state space
X. Suppose that any function in J is a linear combination of a set of K fixed linearly independent (basis) functions
φr : X → R, r = 1, 2, . . . , K. More explicitly, for θ ∈ RK ,
K
X
Jθ (x) = θr φr (x) = φT (x)θ.
r=1
A usual assumption is that kφ(x)k2 ≤ 1 uniformly on X, which can be achieved by normalizing the basis functions.
Compactly,
Jθ = Φθ,
6 For example, SARSA is often implemented in practice with -greedy exploration for a diminishing (-greedy GLIE exploration).
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-14
where T
Jθ (1) φ (1)
Jθ = .. ..
and Φ = .
. .
T
Jθ (X) φ (X)
The components φr (x) of vector φ(x) are called features of state x and Φ is called a feature extraction. Φ has to be
judiciously chosen based on our knowledge about the problem.
Feature Selection: In general, features can be chosen in many ways. Suppose that X ⊂ R. We can then use a
polynomial, Fourier or wavelet basis up to some order. For example, a polynomial basis corresponds to choosing
φ(x) = [1, x, x2 , . . . , xK−1 ]T . Clearly, the set of monomials {1, x, x2 , . . . , xK−1 } is a linearly independent set and
forms a basis for the vector space of all polynomials with degree ≤ K − 1. A different option is to choose an
orthogonal system of polynomials, e.g., Hermite, Laguerre or Jacobi polynomials, including the important special cases
of Chebyshev (or Tchebyshev) polynomials and Legendre polynomials.
If X is multi-dimensional, then the tensor product construction is a commonly used way to construct features. To
elaborate, let X ⊂ Q X1 × X2 × · · · × Xn and φi : Xi → Rki , i = 1, 2, . . . , n. Then, the tensor product φ = φ1 ⊗ φ2 ⊗
n
· · · ⊗ φn will have i=1 ki components for a given state vector x, which can be indexed using multi-indices of the form
(i1 , i2 , . . . , in ), 1 ≤ ir ≤ kr , r = 1, 2, . . . , n. With this notation, φ(i1 ,i2 ,...,in ) (x) = φ1,i1 (x1 )φ2,i2 (x2 ) · φn,in (xn ). If
X ⊂ Rn , then a popular choice for {φ1 , φ2 , . . . , φn } are the Radial Basis Function (RBF) networks
h iT
(1) (k ) (r)
φi (xi ) = G(|xi − xi |), . . . , G(|xi − xi i |) , xi ∈ R, r = 1, 2, . . . , ki ,
where G is some user-determined function, often a Gaussian function of the form G(x) = exp(−γx2 ) for some
parameter γ > 0. Moreover, kernel smoothing is also an option for some appropriate choice of points x(i) , i =
1, 2, . . . , K:
K
X G(kx − x(i) k)
Jθ (x) = θi PK .
(j)
i=1 j=1 G(kx − x k)
| {z }
G̃(i) (x)
PK
Here, G̃(i) (x) ≥ 0, ∀x ∈ X and ∀i ∈ {1, 2, . . . , K}. Due to i=1 G̃(i) (x) = 1 for any x ∈ X, Jθ is an example
of an averager. Averager function approximators are non-expansive mappings, i.e., θ → Jθ is non-expansive in the
maximum norm. Therefore, such approximators are suitable for use in reinforcement learning, because they can be
nicely combined with the contraction mappings in this framework.
The previous brief reference to feature selection methods clearly does not exhaust the list of such methods. In general,
many methods for this purpose are available. The interested reader is referred to the relevant literature.
Q-function approximation: In a similar spirit as before, consider basis functions of the form φr : X × U → R,
r = 1, 2, . . . , K. Now φr (x, u), r = 1, 2, . . . , K correspond to the features of a given state-action pair (x, u). For
θ ∈ RK ,
XK
Qθ (x, u) = θr φr (x, u) = φT (x, u)θ.
r=1
Compactly,
Qθ = Φθ,
where T
Qθ (1, 1) φ (1, 1)
Qθ = .. ..
and Φ = .
. .
T
Qθ (X, U ) φ (X, U )
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-15
Under the described function approximation approach, the problem of learning an optimal policy can be formalized in
terms of learning θ. If K |X × U| = XU , then we have to learn a much smaller vector θ than the Q-vector.
Consider first the following approximate Q-Value Iteration algorithm for a known MDP model:
1 X 1 2
min w(x, u)(φT (x, u)θ − Q̃t+1 (x, u))2 = min Φθ − Q̃t+1
θ 2 θ 2 W
(x,u)
| {z }
weighted least squares problem
where w(x, u) are some positive weights and W = diag(w(1, 1), . . . , w(X, U )). We note here that kxk2W =
xT W x and W is a symmetric positive definite matrix. Moreover, kxkW = kW 1/2 xk2 , where W 1/2 is the square
root of W and k · k2 is the `2 -norm7 .
and repeat.
Compact Version: More compactly, the above algorithm can be written as:
1 2
θ̂t+1 = arg min Φθ − T (Φθ̂t ) , t = 0, 1, 2, . . . (10.21)
θ 2 W
The following example aims to show that the choice of the weighted norm k · kW is critical for guaranteeing convergence
of the above scheme.
10.5.2.1 Example
Consider a two-state MDP with no cost and let µ be a stationary policy. The corresponding Markov chain is8 :
7 The notation k · k
W,2 is used sometimes for this norm to differentiate it, e.g., from definitions of weighted max-norms as in (10.12). Nevertheless,
we simplify notation by discarding the subscript “2”. We work exclusively with this weighted `2 -norm for the rest of this file.
8 A stationary policy and an MDP induce a Markov reward process. A more proper term in our case is a Markov cost process.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-16
We suppress from the subsequent notation of all the involved functions the action u, since the policy µ is assumed to be
fixed and therefore u = µ(i) when at state i.
Since the underlying MDP model is known for this problem, we can implement the previously described algorithm for
the operator T given by (10.22). First, for each node of the Markov chain we have:
X
T (Q̂)(1) = c(1) + α P1j Q̂(j)
j
Compactly:
T (Q̂)(1) αε α(1 − ε) Q̂(1)
T (Q̂) = = . (10.24)
T (Q̂)(2) αε α(1 − ε) Q̂(2)
Since there is no cost associated with this problem, we can easily conclude that Q(1) = Q(2) = 0, where Q corresponds
to the (optimal) Q-function and Q(i) = Q(i, µ∗ (i)), i = 1, 2 with µ∗ being an optimal policy. Moreover, at any state
any action is optimal and therefore µ is an optimal policy. Consider the function class:
Q = {Φθ : θ ∈ R}
If the standard Euclidean norm is used for projection, the projection objective becomes:
1 2 1 1
min Φθ − Q̃t+1 = min [θ − θ̂t (αε + 2α(1 − ε))]2 + [2θ − θ̂t (αε + 2α(1 − ε))]2 . (10.26)
θ 2 2 θ 2 2
which corresponds to the exact form of (10.21) for this particular problem. If α̃ > 1, then the algorithm diverges (unless
θ̂0 = 0). Clearly, this can happen for many α and ε. Fortunately, using a different weighted norm, we can guarantee
convergence for any α, ε.
Let
w1 0
W = (10.28)
0 w2
for some positive constants w1 , w2 . Using k · kW based on such a diagonal matrix in the projection objective, we have:
1 2 1 1
min Φθ − Q̃t+1 = w1 [θ − θ̂t (αε + 2α(1 − ε))]2 + w2 [2θ − θ̂t (αε + 2α(1 − ε))]2 . (10.29)
θ 2 W 2 2
Differentiating with respect to θ and setting the derivative to 0, we obtain the recursion:
Idea: Choose the weights w1 , w2 as the entries of the stationary distribution for the Markov chain in Fig. 10.1:
P stationary distribution of the Markov chain can be obtained by solving the equation π = πP using the constraint
The
i πi = 1. π = πP yields in this case (1 − ε)π1 = επ2 . Furthermore, π1 + π2 = 1. Therefore, π1 = ε and π2 = 1 − ε.
We now let w1 = π1 and w2 = π2 . The update rule can be rewritten as:
Clearly, α̃ < 1 for any α, ε in this case and therefore, the algorithm converges. This suggests that one should perhaps
use W = diag(π) to define the projection objective, where π is the stationary distribution of the Markov chain for the
underlying stationary policy (assuming that such an invariant measure exists).
10.5.2.2 Algorithm
1 2
θ̂t+1 = arg min Φθ − T (Φθ̂t ) , t = 0, 1, 2, . . . (10.32)
θ 2 π
for some initial θ̂0 . Here, we abuse notation to write9 k · kπ for k · kdiag(π) . Also, Q̂ = [Q̂(i)] = [φT (i)θ̂].
9 and of course we recall that this is a simplified notation for k · kπ,2 , which differs from the weighted max-norm in (10.12).
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-18
i.e., the solution of the above problem is Q̄ = Π(z). Then, the algorithm given by (10.32) becomes
Q̂t+1 = Π(T (Q̂t )). (10.33)
We will show that Π(T (·)) is a contraction mapping in k · kπ , assuming that we know the underlying MDP model.
Assume now that the MDP model is unknown. Consider for the moment (10.32) and set
2
1
C(θ; Q̃t+1 ) = Φθ − T (Φθ̂t ) .
2 | {z }
Q̃t+1 π
Note: In practice, to evaluate the estimated Q-function at all state action-pairs, ut = µ(xt ) is combined with some
exploration process. The recursion can be then modified to account for ut using ut+1 = µ(xt+1 ) in an online operation.
θ̇ = −f (θ)
converges to θ∗ as t → ∞, then the corresponding stochastic difference equation also converges to θ? almost surely. In
our case, focusing again on a stationary policy µ:
where the expectation is taken in steady state (cf. (10.37)). The associated stochastic recursion is (10.38). Here, we
have suppressed ut , ut+1 from the notation, since ut = µ(xt ), ut+1 = µ(xt+1 ). Writing f (θ) more explicitly, we have:
X X X
πi φT (i)θ − c(i) − Pij φT (j)θ φ(i) = πi φT (i)θ − T (Φθ)(i) φ(i).
f (θ) =
i j i
Now the problem becomes finding the equilibrium θ∗ which satisfies f (θ∗ ) = 0. Solving this equation, we obtain:
ΦT DΦθ∗ − ΦT DT (Φθ∗ ) = 0
or
−1
θ∗ = ΦT DΦ ΦT DT (Φθ∗ ). (10.40)
Does the fixed point equation in (10.40) have a unique solution? Indeed, we are going to show that there is a unique
solution θ∗ to this equation.
Proof. We already know that J = Π(T (J)) has a unique solution because Π(T (·)) is a contraction mapping. Recall
that Π(z) solves
1
min ky − zk2π (10.41)
θ:y=Φθ 2
or
1
min kΦθ − zk2π
θ 2
1
or min (Φθ − z)T D(Φθ − z). (10.42)
θ 2
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-20
Let S(θ) = 21 (Φθ − z)T D(Φθ − z). By setting the gradient of S(θ) to zero, we have
∇θ S(θ) = 0
or ΦT D(Φθ − z) = 0
or θ = (ΦT DΦ)−1 ΦT Dz. (10.43)
Multiplying both sides of the last equation by Φ and using the facts that z = T (J) and J = Φθ we obtain:
Because J = Π(T (J)) has a unique solution, i.e., equation (10.44) has a unique solution J ∗ = Φθ∗ , the fixed point
equation
θ = (ΦT DΦ)−1 ΦT DT (Φθ) (10.45)
has a unique solution θ∗ (due to the implicit assumption that Φ is full column rank).
Next, we will prove that the ODE θ̇ = −f (θ) converges to θ∗ as t → ∞ using an appropriate Lyapunov function.
θ̇ = −f (θ)
= −f (θ) + f (θ∗ )
= −ΦT DΦ(θ − θ∗ ) + ΦT D(T (Φθ) − T (Φθ∗ )). (10.48)
dθ
L̇(θ) = (∇θ L(θ))T
dt
= (∇θ L(θ))T θ̇
= (∇θ L(θ))T [−f (θ) + f (θ∗ )]
= (θ − θ∗ )T [−ΦT DΦ(θ − θ∗ ) + ΦT D(T (Φθ) − T (Φθ∗ ))]
= −(θ − θ∗ )T ΦT DΦ(θ − θ∗ ) + (θ − θ∗ )T ΦT D(T (Φθ) − T (Φθ∗ ))
= −kΦ(θ − θ∗ )k2π + (θ − θ∗ )T ΦT D1/2 D1/2 (T (Φθ) − T (Φθ∗ )). (10.50)
As a consequence:
L̇(θ) ≤ −(1 − α)kΦ(θ − θ∗ )k2π < 0 (10.52)
for any θ and L̇(θ) = 0 only when θ = θ∗ . Therefore, we conclude that θ̇ = −f (θ) converges to θ∗ . Thus, we also
conclude that, under some conditions, the associated stochastic recursion (10.38) converges to θ∗ almost surely as
t → ∞.
We have just shown that our approximation scheme above can obtain the Q-function for a fixed policy µ at the state
action pairs (i, µ(i)). Therefore, for an initial policy µ0 , our algorithm can proceed as follows to find an optimal policy:
1. Using the previous scheme with exploration, estimate the Q-function for all state-action pairs when the underlying
stationary policy used is µk . Let Q̂ be the obtained estimate.
2. Next, update the policy to the greedy selection µk+1 (x) = arg minu Q̂(x, u).
3. Repeat.
The above scheme corresponds to an actor-critic algorithm, which performs policy iteration with function approximation.
Consider again the function class Q = {Qθ = Φθ : θ ∈ RK } for some feature extraction. Then, the Q-learning
algorithm with function approximation is formalized as:
θ̂t+1 = θ̂t + εt c(xt , ut ) + α min φT (xt+1 , v)θ̂t − φT (xt , ut )θ̂t φ(xt , ut ).
v
This iteration coincides with (10.38) when exploration is implemented, except for the definition of the temporal
difference inside the parentheses which includes the minimization minv φT (xt+1 , v)θ̂t (off-policy characteristic)
instead of φT (xt+1 , ut+1 )θ̂t .
Note: Q-learning is not guaranteed to converge when combined with function approximation.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-22
In this section we discuss Monte Carlo policy evaluation and Temporal Difference learning, which are model-free
policy evaluation methods. Policy evaluation algorithms estimate the value function Jµ or Qµ for some given policy µ.
Many problems can be cast as value prediction problems. Estimating the probability of some future event, the expected
time until some event occurs and Jµ or Qµ underlying some policy µ in an MDP are all value prediction problems. In
the context of actor-critic learning and policy evaluation, these are on-policy algorithms, and the considered policy µ
is assumed to be stationary or approximately stationary. Monte-Carlo methods are based on the simple idea of using
sample means to estimate the average of a random quantity. It turns out that the variance of these estimators can be high
and therefore, the quality of the underlying value function estimates can be poor. Moreover, Monte Carlo methods in
closed-loop estimation usually introduce bias. In contrast, TD learning can address these issues. As a final note, TD
learning was introduced by Richard S. Sutton in 1988 and it is widely considered as one of the most influential ideas in
reinforcement learning.
Monte Carlo methods learn from complete episodes of experience without using bootstrapping, i.e., sampling with
replacement. The corresponding idea is described as follows: Suppose that µ is a stationary policy. Assume that we
wish to evaluate the value function in the discounted scenario
"∞ #
X
µ k
Jµ (i) = E α c(xk , uk ) x0 = i
k=0
where the horizon N is a random variable. To evaluate the performance, multiple trajectories of the system can be
simulated using policy µ starting from arbitrary initial conditions up to termination or stopped earlier (truncation of
trajectories). Consider a particular trajectory. After visiting state i at time ti , Jµ (i) is estimated as
Ñ
X Ñ
X
Jˆµ (i) = α k−ti
c(xk , uk ) or Jˆµ (i) = c(xk , uk )
k=ti k=ti
for the discounted or the episodial problem. Here, Ñ is the terminal time instant of the corresponding trajectory. For
the discounted cost problem, Ñ can be chosen large enough to guarantee a small error. Suppose further that we have M
such estimates Jˆµ,1 (i), Jˆµ,2 (i), . . . , Jˆµ,M (i). Then, Jˆµ (i) is finally estimated as
M
1 X ˆ
Jˆµ (i) = Jµ,r (i).
M r=1
• Jˆµ (i) is a single estimate of Jµ (i) if the trajectory starts at i and Jˆµ (i) assumes summation up to Ñ . Clearly,
such an approach requires multiple trajectories to be initialized at every state i ∈ X to obtain M estimates
Jˆµ,1 (i), Jˆµ,2 (i), . . . , Jˆµ,M (i) for every i.
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-23
• We can obtain an estimate Jˆµ,r (i) each time state i is visited on the same trajectory. Every time the corresponding
summation will be taken up to Ñ . This approach implies that we can obtain multiple estimates Jˆµ,r (i) per
trajectory and across trajectories. Clearly, value estimates of this form from the same trajectory are heavily
correlated.
• A single estimate Jˆµ,r (i) can be at most obtained from any trajectory by performing a summation up to termination
starting at the first visit of i in the course of a particular trajectory.
Result: Jˆµ (i) −−−−→ Jµ (i), ∀i ∈ X for all the above schemes if each state i is visited sufficiently often. This depends
M →∞
on the choice of µ and the initial points of the trajectories.
Consider an infinite discounted cost problem and let µ be a stationary policy. Recall the definition of the Tµ operator
defined in earlier lectures:
X X
Tµ (J)(i) = E c(x0 , µ(x0 )) + α Px0 j (µ(x0 )))J(j) x0 = i = c̄(i, µ(i)) + α Pij (µ(i))J(j)
j j
or more compactly Tµ (J) = c̄µ + αPµ J. Recall also that Tµ is a contraction mapping with fixed point the value
function Jµ . The fixed policy Value Iteration method computes Jµ via successive approximations by performing the
recursion Jk+1 = Tµ (Jk ) for some initial guess J0 of Jµ . By the contaction property of Tµ , Jk −−−−→ Jµ .
k→∞
Assume now that the underlying MDP model is unknown. We therefore require some form of learning of Jµ in this
case. Let Jˆ0 be an initial guess of Jµ , e.g., Jˆ0 = J0 . Suppose that we simulate the system using policy µ and at times k
and k + 1 we observe xk , c(xk , µ(xk )) and xk+1 (we also observe c(xk+1 , µ(xk+1 ))). Let Jˆk be an estimate of Jµ at
time k. Then, c(xk , µ(xk )) + αJˆk (xk+1 ) is clearly a noisy estimate of Tµ (Jk )(xk ) in the fixed policy Value Iteration
recursion, i.e., of Jk+1 (xk ). Because the aforementioned estimate is noisy, we can use it to update Jˆk (xk ) only by a
sufficiently small amount which guarantees convergence.
Idea: Stochastic Approximation Recursion similar to Q-learning, SARSA, etc.:
Jˆk+1 (xk ) = (1 − εk (xk ))Jˆk (xk ) + εk (xk ) c(xk , µ(xk )) + αJˆk (xk+1 )
ε
1. Using the previously mentioned stepsize sequence εk (i) = Nk (i) for some ε > 0, where Nk (i) is the number of
visits to state i by time k and assuming that all states are visited infinitely often, Jˆk −−−−→ Jµ with probability 1.
k→∞
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-24
To elaborate a bit more on this point, we note that if TD(0) converges, then it must converge to some J˜ such that10
Equivalently, TµP ˜ − J˜ = 0 and since Tµ has a unique fixed point, we conclude that J˜ = Jµ . Moreover, with
(J)
the assumptions k εk (i) = ∞, k ε2k (i) < ∞ and by the ODE method, TD(0) will follow closely trajectories
P
of the linear ODE:
J˙ = c̄µ + (αPµ − I)J.
The eigenvalues of αPµ − I lie in the open left half complex plane11 and hence, the ODE is globally asymp-
totically stable. Therefore, by the ODE method for stochastic approximation, Jˆk −−−−→ Jµ with probability
k→∞
1.
2. If εk = ε > 0 and each state is visited infinitely often, then “tracking” is achieved, i.e., Jˆk will be close to Jµ in
the long run. More precisely, ∀, δ > 0
in (10.53):
l−1
X
Jˆk+1 (xk ) = Jˆk (xk ) + εk (xk ) αr δk+r . (10.54)
r=0
Here, Jˆk (·) is the value function estimate used in δk+r for any r.
TD(λ) Algorithm with 0 ≤ λ ≤ 1: Considering all future temporal differences with a geometric weighting of
decreasing importance we obtain the following algorithm:
∞
X
Jˆk+1 (xk ) = Jˆk (xk ) + εk (xk ) (λα)r δk+r . (10.55)
r=0
Again, Jˆk (·) is the value function estimate used in δk+r for any r.
λ is called trace-decay parameter and it controls the amount of bootstrapping: For λ = 0 we obtain the TD(0)
algorithm and for λ = 1 a Monte Carlo method (or the TD(1) method) for policy evaluation.
Convergence of TD(λ): Similar to TD(0) but often faster than TD(0) and Monte-Carlo methods if λ is judiciously
chosen. This has been experimentally verified, especially when function approximation is used for the value function.
In practice, good values of λ are determined by trial and error. Also, the value of λ can be modified even during the
algorithm without affecting convergence.
10 More generally, this condition should hold at least for all states that are sampled infinitely often.
11 λ − I) = αλi (Pµ ) − 1 and since α|λi (Pµ )| < 1, it turns out that Re{αλi (Pµ ) − 1} < 0.
i (αPµ
Lecture 10: Q-Learning, Function Approximation, Temporal Difference Learning 10-25
Consider the first equation in the synchronous Q-learning recursion (10.11). Compactly, this recursion can be written in
matrix form as follows:
Q̂k+1 = Q̂k + εk h(Q̂k ) + Mk . (10.56)
Here, Q̂k = [Q̂k (x, u)] is the X × U matrix of Q-value estimates at time k, h(Q̂k ) is the X × U matrix
Then, the associated ODE for the last stochastic matrix recursion is:
Recall that T is a contraction mapping with parameter α. It turns out that this ODE has a unique globally stable fixed
point, the synchronous Q-learning algorithm is stable and the stochastic recursion (10.56) converges to the optimal Q,
which is the unique fixed point of T .