RL Class Notes
RL Class Notes
E1 277
Spring 2025
Pratyush Kant & Sahil Chaudhary
Administrative Details
• Saturdays, 9:30-11:00 AM, sometimes might be used for extra classes or tutorials.
• References:
– Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto
– Neuro-Dynamic Programming by Dimitri Bertsekas and John Tsitsiklis
– Optimal Control and Dynamic Programming by Dimitri Bertsekas
– Reinforcement Learning and Optimal Control by Dimitri Bertsekas
• Teams code: 17jggiq
• The First half will be taken by Shalabh Bhatnagar and the second half by Gugan Thoppe.
• Grading: 50% sessionals, 50% final exam. Shalabh will take a quiz for 5 marks and midterms
for 20 marks. For finals, there will be a course project for 20 marks and a final exam for 30
marks.
• Four TAs: Kaustubh, Naman, Ankur and Prashana.
• Shalabh’s midterm will be on 15th February.
• Shalabh’s quiz towards the end of January.
• Quiz 1 on Saturday, 1st February.
Contents
1 Lecture 1 (Shalabh Bhatnagar) 4
1.1 Classes of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Example of Communication Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Exploration and Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Example of Tic-Tac-Toe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
16 Lecture 16 56
2
18 Lecture 3: Temporal Difference Learning and Function Approximation 60
18.1 Markov Decision Processes (MDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
18.2 Temporal Difference (TD) Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
18.2.1 Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 60
18.2.2 Objective Function and Update Rule . . . . . . . . . . . . . . . . . . . . . . . 60
18.3 TD(0) with Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . 61
3
1 Lecture 1 (Shalabh Bhatnagar)
Learning theory is categorized into three types: supervised learning, unsupervised learning and
reinforcement learning. In supervised learning, we have a dataset of input-output pairs, and we
try to learn a function that maps inputs to outputs. In unsupervised learning, we have a dataset
of inputs, and we try to learn the underlying structure of the data. In reinforcement learning, we
have an agent that interacts with an environment and tries to learn a policy that maximizes the
cumulative reward.
There is an agent, an environment and states. States describe the key features of the environment.
The agent is a decision-making entity that interacts with the environment. In the beginning, the
environment is in state S0 . The agent takes action A0 , and the environment transitions probabilis-
tically to a new state S1 and gives a reward R1 to the agent. The agent looks at the new state and
reward and takes another action A1 . The environment again jumps probabilistically to a new state
S2 and gives a reward R2 to the agent. This process continues. The goal of the agent is to select a
sequence of actions depending on the states of the environment so as to maximize the “long-term
reward”.1
Probabilistic transition between states is given by
If the probabilities are stationary, subscript t can is dropped. We will also discretize the time,
though it can be continuous in some cases as well.
• N = ∞:
1
Transition dynamics are stationary usually but can be non-stationary as well (as in traffic).
4
– Discounted rewards: The long-term reward is given by
"N #
X
lim E γ t Rt+1 | S0 = s ,
N →∞
t=0
where γ ∈ (0, 1) is the discount factor. This has connections to economic theory and is
a good-model if value of future rewards is less than the value of immediate rewards.
– Long-term average reward: The long-term reward is given by
"N #
1 X
lim E Rt | S0 = s .
N →∞ N
t=1
5
1.3 Exploration and Exploitation
The problem of exploration and exploitation is a fundamental problem in learning theory. Explo-
ration is the process of trying out new things to learn more about the environment. Exploitation
is the process of using the knowledge gained so far to maximize the reward. The agent has to
balance between exploration and exploitation. If the agent exploits too much, it might miss out on
better actions. If the agent explores too much, it might not get enough reward. The agent has to
balance between exploration and exploitation. The agent has to explore enough to learn about the
environment and exploit enough to maximize the reward.
The way around which most methods use is to select a learnt action with a high probability and
select a random action with a low probability that has not been selected so far.
6
2 Lecture 2 (Shalabh Bhatnagar, Sutton chapter 1 and 2)
2.1 Sutton Chapter 1
The basic setting is an agent interacting with an environment and learns through this interaction.
In practice, there can be more than one agent but in this course, we will consider only one agent.
The agent looks at the state of the environment and takes an action. The environment transitions
to a new state (probabilistically) and gives a reward (probabilistically) to the agent. The agent
learns from the reward and the new state and takes another action. This process continues. The
goal of the agent is to learn a policy that maximizes the long-term reward. The state, action and
reward sequence generated is
S0 , A0 , R1 , S1 , A1 , R2 , S2 , . . . , SN −1 , AN −1 , RN , SN .
The goal of the agent is to select a sequence of actions in response to the states of the environment
so as to maximize the “long-term reward”. The long-term rewards are dependent on short-term
rewards. To throw about the randomness in the environment, we use the expectation of the rewards.
The long-term reward is called as value function.
Policy is a decision rule: In a given state, it prescribes an action to be chosen. They can be
deterministic or stochastic. For instance, if number os states is 2, S1 , s2 and number of actions 3,
a1 , a2 , a3 .
• Deterministic policy: π(s1 ) = a1 , π(s2 ) = a3 .
• Stochastic policy:
– π(s1 , a1 ) = 0.7, π(s1 , a2 ) = 0.3, π(s1 , a3 ) = 0.
– π(s2 , a1 ) = 0.2, π(s2 , a2 ) = 0.3, π(s2 , a3 ) = 0.5.
• Optimal policy: The policy that maximizes the long-term reward.
The objective is to find a policy π which maximizes the value function.
There are two parts to a RL problem:
1. Prediction: Given a policy π, estimate the value, Vπ , of the policy.
2. Control: Find the optimal policy. Control problem can only be solved after solving the
prediction problem.
We will also assume Markovian structure, i.e., the future depends only on the present state and
not on the past states:
The agent also remembers the entire history of the interaction. We will now start chapter 2 of
Sutton’s book.
7
We will consider a model with a single slot machine with K arms. The state is a single state and
the objective is to decide which arws to pull in what order and how many times to pull each arm.
Each time an arm is pulled a reward is generated randomly based on the probability distribution
of the arm. We will also assume their is no correlation between the rewards of the arms, hence the
sequence becomes irrelevant.
Define
q ∗ (a) := E[Rt | At = a] a ∈ {1, 2, . . . , K}.
The goal is to find a∗ = arg maxa∈{1,2,...,K} q ∗ (a). The agent does not know q(a) and has to estimate
it. The agent has no information about the distribution of the rewards.
Define Pn
i=1 Ri · I(Ai−1 = a)
Qn (a) := P n ∀ a ∈ {1, 2, . . . , K}.
i=1 I(Ai−1 = a)
The expression is the average reward of arm a after n pulls serving as an estimate of q ∗ (a) at time
n. The possible strategies are:
• Greedy strategy: Pull the arm a where
Pn+1
Ri+1
i=1
Qn+1 (a) =
n+1
n
!
1 X
= Ri+1 + Rn+2
n+1
i=1
1
= (nQn (a) + Rn+2 )
n+1
1
= Qn (a) + (Rn+2 − Qn (a)) .
n+1
8
This saves storage and is called as incremental update rule as we don’t need to store all the
rewards. A reward is seen and it is used to update the estimate of the mean of the arm without
storing it.
a.s.
Qn (a) −−→ q ∗ (a) as n → ∞ i.e. P lim Qn (a) = q ∗ (a) = 1.
n→∞
Instead if we use the update rule for some α ∈ (0, 1) being a constant:
This is called as exponential recency-weighted average which gives more weight to recent
rewards. Verify that the sum of all the weights is 1. These class of algorithms are called as
fading memory algorithms. These are typically used in non-stationary environments where the
dynamics changes over time.
Q0 (a) is the initial estimate of the mean of the arm. which is typically set to 0 if no information is
available. Otherwise if some information is available, it can be set to the mean of the rewards.
9
3 Lecture 3 (Shalabh Bhatnagar, Sutton chapter 2)
In the case of multi-armed bandits, if the ri ’s are deterministic then there is no need for exploration.
But in the case of stochastic rewards, exploration is needed.
SLLN: Let X1 , X2 , . . . be a sequence of i.i.d. random variables with E[Xi ] = µ < ∞. Then,
n n
!
1X a.s. 1X
Xi −−→ µ as n → ∞ i.e. P lim Xi = µ = 1.
n n→∞ n
i=1 i=1
∞
X ∞
X
αt = ∞ and αt2 < ∞.
t=1 t=1
1
Thus sequences like t+1 1
and (t+1) log(t+1) , log(t+1)
t+1 where t ≥ 1 are valid. These algorithms are called
as stochastic approximation algorithms.
xt+1 = xt + αt (f (xt ) + Ψt ) ,
where αt is a positive sequence and f (xt ) + Ψt is the noisy sample of f at time t. One can provably
t→∞
argue that under some conditions such that starting from arbitrary x0 , xt −−−→ x∗ such that
10
f (x∗ ) = 0. What they showed was that the sequence of xt ’s converges to the root of f in the mean
square sense, i.e.,
t→∞
E ∥xt − x∗ ∥2 −−−→ 0.
Subsequently, it was shown that the sequence of xt ’s converges to the root of f almost surely, i.e.,
P lim xt = x∗ = 1.
t→∞
a
a a
The noise is Rt+1 − E Rt+1 | At = a and thefunction is f (Q t ) = E Rt+1 | At = a − Qt (a) and
hence the algorithm converges to Q∗ (a) = E Rt+1
a | At = a (which is q ∗ (a) defined previously)
11
Initially at t = 0, then Nt (a) = 0 for all a and we select one action arbitrarily, say a. Then,
Nt (a) = 1 and the uncertainty term is not ∞ for a unlike other arms.
As t increases, Nt (a) increases as well but log t increases at a much slower rate and is practically
a constant. Hence, the exploration term decreases as t increases and eventually dies out with the
algorithm using the exploitation term only. In summary, initially the algorithm explores and as t
increases, the algorithm exploits.
In order to update the preferences, we use the following gradient update rule:
∂E[Rt ]
Ht+1 (a) = Ht (a) + αt ,
∂Ht (a)
P
where E[Rt ] = x πt (x)q∗ (x) and q∗ (x) = E[Rt | At = x].
!
∂E[Rt ] ∂ X
= πt (x)q∗ (x)
∂Ht (a) ∂Ht (a) x
X ∂πt (x)
= q∗ (x)
x
∂Ht (a)
X ∂πt (x)
= (q∗ (x) − βt ) ·
x
∂Ht (a)
12
A good choice of βt brings down the variance of the rewards improving the convergence of the
algorithm keeping the expression unbiased, since, βt is independent of x and thus,
X ∂πt (x) X ∂πt (x)
βt = βt
x
∂Ht (a) x
∂Ht (a)
!
∂ X
= βt πt (x)
∂Ht (a) x
∂
= βt (1) = 0.
∂Ht (a)
H (x)
Recall πt (x) = Pe t . Then,
y Ht (y)
!
∂πt (x) ∂ eHt (x)
= P H (y)
∂Ht (a) ∂Ht (a) ye
t
∂E[Rt ] X
= (q∗ (x) − βt ) · πt (x)(1x=a − πt (a)).
∂Ht (a) x
We select βt = R̄t where R̄t is the average reward at time t. Thus, the update rule becomes:
X
Ht+1 (a) = Ht (a) + αt q∗ (x) − R̄t · πt (x)(1x=a − πt (a)).
x
Observe that
X
q∗ (x) − R̄t · πt (x)(1x=a − πt (a)) = E q∗ (At ) − R̄t (1At =a − πt (a))
x
t→∞
Running the algorithm for a long time, we get Ht (a) −−−→ H ∗ (a) where H ∗ (a) is the preference
for arm a at convergence. The algorithm is called as softmax action selection.
13
4 Finite Horizons Problems (Bertsekas Volume 1, Chapter 1, Sha-
labh Bhatnagar)
4.1 Markov Decision Processes (MDPs)
MDPs assume that we know the system model. However, finding the optimal policy is still a difficult
problem. The key assumption is a controlled Markov chain. We have a state space denoted by S
and an action space denoted by A. Given a state s, A(s) denotes the set of feasible actions in state
s. Further, [
A= A(s).
s∈S
Let {Xn } be a sequence of random variables defined on a common probability space, (Ω, F, P).
It depends on a control valued sequence {Zn } such that Zn ∈ A(Xn ). The sequence {Xn } is a
controlled Markov chain if for all n ≥ 0 and all s ∈ S,
Very similar to the definition of Markov chains. The transition probability is denoted by p(s, a, s′ )
(called as transition probabilities) and is P(Xn+1 = s′ | Xn = s, Zn = a). Some properties are:
1. p(s, a, s′ ) ≥ 0.
′
P
2. s′ ∈S p(s, a, s ) = 1.
A Markov Decision Process (MDP) is a controlled Markov chain with a cost structure, a cost
associated with every transition, denoted by g(in , an , in+1 ). The cost is a function of the current
state, the action taken and the next state. Here, Xn = in , Xn+1 = in+1 and Zn = an . Clearly,
in , in+1 ∈ S and an ∈ A(in ).
For now, we will assume that the state space, action space and feasible actions are known along
with the transition probabilities.
π = {µ0 , µ1 , . . . , µN −1 }.
At time N the process terminates (called as terminal instant). For each k ∈ [N − 1], µk : S 7→ A
is a function such that µk (s) ∈ A(s) for all s ∈ S.
14
µ0 is used to take actions at time 0, µ1 at time 1 and so on. The collection of these functions is
called as a policy. The objective is to find an optimal policy, π ∗ , that minimizes the cost over the
horizon N .
For each x0 ∈ S and each policy π define Jπ (x0 ) as
N −1
" #
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), xk+1 ) | X0 = x0 .
k=0
The expectation is taken over the joint distribution of the random variables X0 , X1 , . . . , XN under
the policy π.
The objective is to find a policy π ∗ such that Jπ∗ (x0 ) ≤ Jπ (x0 ) for all x0 ∈ S and all policies π.
Here gk (xk , µk (xk ), xk+1 ) is the cost incurred at time k when the state is xk , action is ak = µk (xk )
and the next state is xk+1 at any instant k ∈ [N −1]. gN (xN ) is the terminal cost when the terminal
state is xN .1 In a finite horizon problem, it is very difficult to find a policy independent of time.
The optimal policy is time-dependent most of the time.
Let Π denote the set of all policies. The optimal policy is denoted by π ∗ ∈ Π such that Jπ∗ (x0 ) ≤
Jπ (x0 ) for all x0 ∈ S and all π ∈ Π. The optimal cost is denoted by J ∗ (x0 ) = Jπ∗ (x0 ) =
minπ∈Π Jπ (x0 ). Observe that π ∗ is independent of x0 . One can also have multiple optimal policies.
The principle of optimality states that the optimal policy for this subproblem is
The optimal policy for the subproblem is the same as the optimal policy for the original problem.
This is the principle of optimality. The principle of optimality is a necessary condition for optimality.
If it was not true, then the optimal policy for the subproblem would be different from the optimal
policy for the original problem. This would imply that the optimal policy for the original problem
is not optimal, which is a contradiction.
Hence, the optimal policy is independent of the history of the system.
15
Proposition: For every initial state x0 ∈ S, the optimal cost J ∗ (x0 ) = J0 (x0 ), that is the optimal
from the last step of the following algorithm:
JN (xN ) = gN (xN ) ∀ xN ∈ S,
Jk (xk ) = min EXk+1 [gk (xk , ak , xk+1 ) + Jk+1 (xk+1 )] ∀ k ∈ [N − 1], xk ∈ S.
ak ∈A(xk )
The second equation means that the optimal cost at time k and state xk is the minimum of the
cost of taking action ak in state xk and the expected cost of following the optimal policy from time
k + 1 and state xk+1 .
Jk∗ (xk ) is the optimal cost for the N − k stage subproblem. Let JN ∗ (x ) = g (x ) = J (x ) for
N N N N N
all xN ∈ S. We will show using induction that Jk∗ (xk ) = Jk (xk ) for all k ∈ [N − 1] ∪ {0} and
all xk ∈ S. Assume that for some k and all xk+1 , we have Jk+1 ∗ (x
k+1 ) = Jk+1 (xk+1 ). Note that
k k+1
π = {µk , π }. Then, ∀ xk ∈ S,
"N −1 #
X
∗
Jk (xk ) = min E(Xk+1 ,...,XN ) gi (xi , µi (xi ), xi+1 ) + gN (xN ) | Xk = xk .
µk ,π k+1
i=k
∗ (x
By definition, the inner expectation is Jk+1 ∗
k+1 ). From the induction hypothesis, Jk+1 (xk+1 ) =
Jk+1 (xk+1 ). Hence, the above expression is:
Jk∗ (xk ) = min EXk+1 [gk (xk , µk (xk ), xk+1 ) + Jk+1 (xk+1 ) | Xk = xk ] .
µk
Recall µk : S 7→ A and µk (xk ) ∈ A(xk ). Hence, the above expression reduces to:
This is the definition of Jk (xk ). Hence, Jk∗ (xk ) = Jk (xk ) for all k ∈ [N − 1] ∪ {0} and all xk ∈ S.
16
5 Finite Horizon Problems (Shalabh Bhatnagar)
5.1 Dynamic Programming Example: Chess Match
Consider an example of a chess match between a player and an opponent. The goal is to formulate
an optimal policy from the viewpoint of the player. A player can select:
• Timid play: The player plays defensively and never wins. The draw has probability pd and
the loss has probability 1 − pd .
• Bold play: The player plays aggressively and never draws. The win has probability pw and
the loss has probability 1 − pw .
Once a player chooses a strategy, it then sticks to it. Further, pd > pw . The score assignment is as
follows:
• Win: 1.
• Draw: 0.5.
• Loss: 0.
We define the state space as s = (points of a player) - (points of an opponent). (This is by design
a maximization problem.) We further assume that the intermediate rewards rk (xk , a, xk+1 ) are 0
for all k ∈ [N − 1]. Only the terminal reward rN (xN ) = JN (xN ) shows up. If there is a draw, then
the match goes into death mode and the person who wins the next game wins the match.
The optimal reward to go at k-th stage is denoted by Jk (xk ) and is:
Jk (xk ) = max {pd Jk+1 (xk ) + (1 − pd )Jk+1 (xk − 1), pw Jk+1 (xk + 1) + (1 − pw )Jk+1 (xk − 1)} .
JN −1 (xN −1 ) = max {pd JN (xN −1 ) + (1 − pd )JN (xN −1 − 1), pw JN (xN −1 + 1) + (1 − pw )JN (xN −1 − 1)} .
17
In this situation, optimal strategy is to play boldly. If the player is higher by one point, it’s optimal
to play timid.
The analogous equations hold for slow service with qs replacing qf . The Dynamic Programming
algorithm is as follows:
18
6 Stochastic Shortest Problem (Shalabh Bhatnagar)
So far, we have seen
• Bascs of RL. (Ch 1 of Sutton and Barto)
• Multi-armed bandit problem. (Ch 2 of Sutton and Barto)
• Finite Horizon MDPs. (Ch 1 of Bertsekas Volume 1)
Today we will cover the Stochastic Shortest Path Problem. (Refer to NDP, RL and Optimal Control
by Dimitri Bertsekas and John Tsitsiklis or Ch 2 of Bertsekas Volume 1 (much more detailed)).
Stochastic shortest-path problems are characterized by a goal state or terminal state. The goal
state occurs with probability 1, but when it occurs, it is not known. Stochastic Shortest path
problems are also referred to as episodic problems.
We consider stationary policies, i.e., π = {µ, µ, . . .}. It’s convenient to call µ as the policy instead
of π. A stationary policy can be shown to be the optimal in Stochastic Shortest Path Problems
and even Infinite Horizon Problems. We give some definitions:
1. Proper Policy: A stationary policy µ is proper if
where n is the number of non-terminal states. It says that after n states, the probability of
reaching the terminal state is positive.
2. Improper Policy: A stationary policy µ is improper if
Pµ = max P(sn ̸= 0 | s0 = i, µ) = 1.
i=1,2,...,n
It says that after n states, the probability of reaching the terminal state is 0. Basically, a
policy which is not proper is improper.
As a remark, µ is proper ⇐⇒ In the Markov Chain cooresponding to µ, there is a path of positive
probability from any state to the terminal state. Recall the general definition of an MDP:
19
Denote
P(Xn+1 = j | Xn = i, Zn = µ(i)) = Pµ (i, j)
where Pµ is the transition probability matrix satisfying:
1. Pµ (i, j) ≥ 0.
PN
2. j=1 Pµ (i, j) = 1.
It is a homogeneous Markov Chain. If the policy is non-stationary, then the transition probabilities
depend on the time instant. The transition probabilities are denoted by Pµk (i, j) where k is the
time instant, and the Markov chain is a non-homogeneous Markov Chain.
We ask ourselves, what is the probability that after 2n steps, the system is not in the terminal
state?
P(S2n ̸= 0 | S0 = i, µ) = P(S2n ̸= 0 | Sn ̸= 0, S0 = i, µ) · P(Sn ̸= 0 | S0 = i, µ)
+ P(S2n ̸= 0 | Sn = 0, S0 = i, µ) · P(Sn = 0 | S0 = i, µ).
If µ is a proper policy and |g| is bounded, |g(i, a, j)| ≤ M for all i, j, a ∈ A(i), then
∞
!
X
Jµ (i) = Eµ g(sk , µ(sk ), Sk+1 ) | S0 = i .
k=0
∞
X
|Jµ (i)| ≤ Eµ (|g(sm , µ(sk ), Sm+1 ) | S0 = i)
m=0
X∞ XX
= pm
ij (µ(i)) · Pjk (µ(j)) · |g(j, µ(j), k)|.
m=0 j k
20
P
Let ĝµ (j) := k Pjk (µ(j)) · |g(j, µ(j), k)|. Then,
∞ X
X
|Jµ (i)| ≤ pm
ij (µ) · ĝµ (j).
m=0 j
Pn m ⌊m⌋
j=1 pij (µ) = P(Sm ̸= 0, s0 = i, µ) ≤ Pµ (i) and maxj=1,2,...,n ĝµ (j) ≤ k. Thus,
n
Further,
∞
X ⌊m⌋
|Jµ (i)| ≤ Pµ n (i) · k < ∞ ∀ i ∈ [n] as Pµ (i) < 1.
m=0
Define ḡ(i, µ) = nj=0 pij (µ) · g(i, µ(i), j) is the expected single stage cost in non-terminal state i ∈
P
[n] when action µ is chosen. We now define mappings T and Tµ as a function J = (J(1), . . . , J(n))
where J is a mapping from non-terminal states (NT) to R.
X n
(T J)(i) := min ḡ(i, µ) + pij (µ) · J(j) ∀ i ∈ [n]
µ∈A(i)
j=1
n
X
(Tµ j)(i) := ḡµ (i) + pij (µ) · J(j) ∀ i ∈ [n] and ḡµ (i) = ḡ(i, µ(i)).
j=1
T and Tµ are operators on the space of mappings from NT states to R. They act on J to give
another mapping. Define the matrix Pµ as
p11 (µ(1)) p12 (µ(1)) . . . p1n (µ(1))
p21 (µ(2)) p22 (µ(2)) . . . p2n (µ(2))
Pµ = .
.. .. .. ..
. . . .
pn1 (µ(n)) pn2 (µ(n)) . . . pnn (µ(n))
Pµ is not a stochastic matrix (in general) as the sum of the elements in each row is ≤ 1 since the
matrix is only over the non-terminal states.
21
7 Lecture 7: Stochastic Shortest Path Problems
Episodic or Stochastic Shortest Path Problems are characterized by a goal state or terminal state,
0. The goal state occurs with probability 1, but when it occurs, it is not known. Stochastic Shortest
path problems are also referred to as episodic problems. The nonterminal states are referred to as
1, 2, . . . , n.
p00 (u) = 1 ∀ u ∈ A(0) and g(0, u, 0) = 0 ∀ u ∈ A(0).
We say that a policy µ is proper if
It says that after n states, the probability of reaching the terminal state is positive. Let S =
{1, . . . , n} be the set of non-terminal states. Let S + = S ∪ {0}.
Let’s get back to the analysis of the Stochastic Shortest Path Problem.
n
X
ḡ(i, u) = pij (u) · g(i, u, j)
j=0
is the expected single stage cost in the state i ∈ [n] when action u is chosen. Define mappings
T, Tµ : R|S| 7→ R|S| , where R|S| = {f | f : S →7 R}, as follows: Let J = (J(1), . . . , J(n)) be a
mapping from non-terminal states to R. Then,
Xn
(T J)(i) := min ḡ(i, u) + pij (u) · J(j) ∀ i ∈ S
u∈A(i)
j=1
n
X
(Tµ J)(i) := ḡµ (i) + pij (µ(i)) · J(j) ∀ i ∈ S and ḡµ (i) = ḡ(i, µ(i)).
j=1
Pµ is not a stochastic matrix (in general) as the sum of the elements in each row is ≤ 1 since the
matrix is only over the non-terminal states.
Using this notation, we can write
Tµ J = ḡµ + Pµ J,
where ḡµ = (ḡµ (1), . . . , ḡµ (n)). Further define T k J = T (T k−1 J) for k ≥ 0, where T 0 := I.
T k J = (T ◦ T ◦ . . . ◦ T )J,
22
which is the k-fold composition of T with itself applied to J.
Consider k = 2. Then,
n
X
(T 2 )J(i) = T (T J)(i) = min ḡ(i, u) + pij (u) · (T J)(j)
u∈A(i)
j=1
!
n
X n
X
= min ḡ(i, u) + pij (u) · min ḡ(j, v) + pjk (v) · J(k) .
u∈A(i) v∈A(j)
j=1 k=1
The above expression can be interpreted in the context of finite horizon problems as the optimal
cost of a two stage problem with single stage costs ḡ(·, ·) and terminal cost J(·). Then, for any k,
(T k J)(i) is the optimal cost of a k-stage problem with initial state i, single stage costs ḡ(·, ·) and
terminal cost J(·).
Xn
(T k J)(i) = min ḡ(i, u) + pij (u) · (T k−1 J)(j) ∀ i ∈ S = {1, 2, . . . , n}.
u∈A(i)
j=1
¯
Lemma 1: Monotonicity Lemma: For any J, J¯ ∈ R|S| , if J(i) ≤ J(i) for all i ∈ S, then
¯ and Tµ J(i) ≤ Tµ J(i)
T J(i) ≤ T J(i) ¯ for all i ∈ S.
23
Proof. 1. Consider k = 1.
n
X
(T (J + re))(i) = min ḡ(i, u) + pij (u) · (J + re)(j)
u∈A(i)
j=1
n
X n
X
= min ḡ(i, u) + pij (u) · J(j) + r pij (u)
u∈A(i)
j=1 j=1
n
X
≤ min ḡ(i, u) + pij (u) · J(j) + r = (T J)(i) + r.
u∈A(i)
j=1
This is the Bellman equation for the policy µ. This method also gives a way to compute Jµ by
iterating Tµ starting from an arbitrary vector J. This numerical method is called value iteration.
24
We have shown seen that
k
P(sk ̸= 0 | s0 = i, µ) ≤ Pµ (i)⌊ n ⌋ ∀ i ∈ S.
n
X
(Pµk J)(i) = P(sk = j | s0 = i, µ) · J(j)
j=1
n
X
≤ P(sk = j | s0 = i, µ) · max J(j)
j=1,2,...,n
j=1
Hence,
k−1
X
lim Tµk J = lim Tµk J = Pµk J + Pµm ḡµ = Jµ .
k→∞ k→∞
m=0
25
8 Lecture 8: Stochastic Shortest Path (Shalabh Bhatnagar)
Recall the proposition,
(a) For a proper policy µ, the associated cost vector Jµ satisfies
Proof. For a stationary policy µ, suppose ∃ J ∈ Rn such that J(i) ≥ (Tµ J)(i) for all i ∈ S. By
monotonicity of Tµ , we have (Tµ J)(i) ≥ (Tµ2 J)(i). Applying this recursively, we get
k−1
!
X
J(i) ≥ (Tµ J)(i) ≥ (Tµ2 J)(i) ≥ (Tµ J)(i) = (Pµ J)(i) + Pµk gµ (i).
m=0
If µ were not proper, then by assumption Jµ (i) = ∞ for some i ∈ S, a contradiction to the above
inequality as limk→∞ k−1 m
P
m=0 Pµ gµ = Jµ and J(i) is finite, =⇒ ⇐= . Hence, µ must be proper.
Proof. ((a), (b)) We will first show that T has most one fixed point. Suppose J and J ′ are two
fixed points of T . Let µ and µ′ be such that
J = T J = Tµ J
J ′ = T J ′ = Tµ′ J ′ .
Note that X
(T J)(i) = max pij (µ) (g(i, µ, j) + J(j)) ∀ i ∈ S.
u∈A(i)
j∈S
J = Tµ J J ′ = Tµ′ J ′ .
26
From the proposition 1 (b), µ and µ′ are proper. By proposition 1 (a), J = Jµ and J ′ = Jµ′ . Now,
J = T J = T 2 J = . . . = T k J,
for any k ≥ 1. Further, T k J ≤ Tµk′ J as Tµ′ is evaluation but T k is minimization. It then follows
that J ≤ limk→∞ Tµk′ J = Jµ′ = J ′ . Similarly, J ′ ≤ J. Hence, J = J ′ and T has at most one fixed
point.
We will now show that T has at most one fixed point. Let µ be a proper policy (there exists a
proper policy by assumption A). Let µ′ be another policy such that Tµ′ Jµ = T Jµ . Then,
Jµ = Tµ Jµ ≥ T Jµ = Tµ′ Jµ
=⇒ Jµ ≥ Tµ′ Jµ =⇒ µ′ is proper by proposition 1 (b).
Furthermore,
Continuing in this manner, we obtain a sequence of policies {µk } such that each µk is proper and
Jµk = Tµk Jµk ≥ T Jµk = Tµk+1 Jµk ≥ Tµ2k+1 Jµk ≥ lim Tµnk+1 Jµk = Jµk+1 .
n→∞
Hence,
Jµk ≥ Tµk+1 Jµk ≥ Jµk+1 ∀ k,
where Tµk+1 Jµk = T Jµk . However, one cannot keep on improving the cost on Jµk indefinitely.
Hence, there exists a policy µ such that Jµ ≥ T Jµ ≥ Jµ . Hence, Jµ = T Jµ . By proposition 1 (a),
Jµ is the unique fixed point of T .
k→∞
Next we willl show that Jµ = J ∗ and T k J −−−→ J ∗ . Let e = (1, 1, . . . , 1) and δ > 0 is a scalar. Let
Jˆ be a n-dimensional vector satisfying Tµ Jˆ = Jˆ − δe.
Tµ Jˆ = Jˆ − δe
=⇒ Jˆ = Tµ Jˆ + δe
ˆ
= (gµ + δe) + Pµ J.
Jˆ is the cost vector corresponding to the policy µ with gµ replaced by gµ + δe. Hence, there will
exist Jˆ satisfying Tµ Jˆ = Jˆ − δe. Moreover, Jµ ≤ J.
ˆ This implies that
Jµ = T Jµ ≤ T Jˆ ≤ Tµ Jˆ = Jˆ − δe ≤ J.
ˆ
Jµ = T k Jµ ≤ Tµk Jµ ≤ T k−1 Jˆ ≤ J.
ˆ
k→∞
Thus, T k Jˆ is a bounded monotone sequence and T k Jˆ −−−→ Jˆ such that
T J = T lim T J = lim T k+1 Jˆ = J˜ =⇒ J˜ = Jµ
ˆ k ˆ
k→∞ k→∞
Jµ − δe = T Jµ − δe ≤ T (Jµ − δe) ≤ T Jµ = Jµ .
27
Further,
T (Jµ − δe) ≤ T 2 (Jµ − δe) ≤ . . . ≤ Jµ .
Hence, T k (Jµ − δe) is a monotonically increasing sequence bounded above. Also, limk→∞ T k (Jµ −
δe) = Jµ such that Jµ − δe ≤ J ≤ J.ˆ (Recall that Jˆ is the cost vector for policy µ with single stage
costs gµ + δe.) Again from monotonicity of T ,
T k (Jµ − δe) ≤ T k J ≤ T k Jˆ ∀ k ≥ 1.
Also,
Jµ = lim T k (Jµ − δe) ≤ lim T k J ≤ lim T k Jˆ = Jµ .
k→∞ k→∞ k→∞
Proof. ((c)) If µ is optimal, then Jµ = J ∗ . By assumptions (A) and (B), µ is proper. By proposition
1 (a),
Tµ J ∗ = Tµ Jµ = Jµ = J ∗ = T J ∗ .
28
9 Lecture 9: Stochastic Shortest Path (Shalabh Bhatnagar)
Recall that we were looking at the operator T : R|S| 7→ R|S| defined as
X
T J(i) = min pij (u) (g(i, u, j) + J(j)) ∀ i ∈ S.
u∈A(i)
j∈S
In today’s lecture, we will show that T and Tµ are contraction maps in a certain norm, || · ||ψ , i.e.,
∃ β ∈ (0, 1) such that for all J, J¯ ∈ R|S| ,
¯ ψ ≤ β||J − J||
||T J − T J|| ¯ ψ,
and
¯ ψ ≤ β||J − J||
||Tµ J − Tµ J|| ¯ ψ.
Recall that S = {1, 2, . . . , n} is the set of non-terminal states and 0 is the terminal state. Let
S + = S ∪ {0} be the set of all states.
Browder’s Fixed Point Theorem: Let S is a complete separable metric space concerning a
metric P . Suppose T is a contraction concerning P . Then, ∃ a fixed point x∗ of T such that
T x∗ = x∗ .1
We will show that there is a vector φ = (φ(1), . . . , φ(n)) such that φ(i) > 0 for all i and a scalar
β ∈ [0, 1) such that for all J, J¯ ∈ R|S| ,
¯ ψ ≤ β||J − J||
||T J − T J|| ¯ ψ,
where
|J(i)|
||J||ψ = max .
i∈S φ(i)
Proposition Assume all stationary policies are proper. Then, ∃ a vector φ = (φ(1), . . . , φ(n)) such
that φ(i) > 0 for all i such that the mappings T and Tµ for all stationary policies µ are contractions
with respect to the norm || · ||ψ . In particular, ∃ β ∈ (0, 1) such that
n
X
pij (u)φ(j) ≤ βφ(i) ∀ i ∈ S, u ∈ A(i).
j=1
Proof. Consider a new stochastic shortest path problem where transition probabilities are the same
as before. Still, transition costs are all equal to −1, except the transition state where
g(0, u, 0) = 0 ∀ u ∈ A(0).
1
A complete metric space is a metric space in which every Cauchy sequence converges to a point in the space. A
separable metric space is a metric space that has a countable dense subset.
29
ˆ as the optimal cost to go from state i in the new problem. Then,
Denote J(i)
X
ˆ = −1 + min
J(i) ˆ
pij (u)J(j)
u∈A(i)
j∈S
X
≤ −1 + ˆ for any given u ∈ A(i).
pij (u)J(j)
j∈S
ˆ
Let φ(i) = −J(i). Then, ∀ i, φ(i) ≥ 1. Then,
X
ˆ ≥1+
−J(i) ˆ
pij (u) −J(j)
j∈S
X
φ(i) ≥ 1 + pij (u)φ(j).
j∈S
X φ(i) − 1
pij (u)φ(j) ≤ φ(i) − 1 ≤ βφ(i) β := max < 1.
i∈S φ(i)
j∈S
We now turn our attention to numerical schemes for solving the MDPs.
30
9.1 Numerical Schemes for MDPs
9.1.1 Value Iteration
Recall the proposition 1, for all J ∈ R|S| ,
lim Tµk J = J
k→∞
lim T k J = J ∗ .
k→∞
n→∞
By proposition 1, Vn −−−→ V ∗ where V ∗ is the optimal cost-to-go function satisfying V ∗ = T V ∗ .
Look at Sutton and Barto, Chapter 4, the Grid World example. In the Grid World example, the
state space is S = {1, 2, . . . , 16} and the action space is A(i) = {N, E, S, W } for all i ∈ S. The
two corners on the main diagonal are terminal states while the rest 14 are non-terminal states.
The feasible actions are such that the agent cannot move out of the grid. If it is a non-feasible
direction, the agent stays in the same state (cell). The rewards are all −1 until termination. After
termination, the reward is 0. Therefore, the agent must reach the goal state as soon as it can.
Consider the equiprobable random policy in which for any of the 14 non-terminal states, the agent
moves in any of the four directions with equal probability 41 . Given this policy, the aim is to apply
value iteration for the equiprobable random policy (we are not looking at the optimal policy).
• Initialize V0 (i) = 0 for all i ∈ S. This is the initial cost-to-go function.
• For the first iteration (k = 1),
1 1
V (1) ← (−1 − 1 − 1 − 1) + (0) = −1.
4 4
Likewise check that V (i) = −1 for all i ∈ S.
• For the second iteration (k = 2),
1 1 7
V (1) ← (−1 − 1 − 1 − 1) + (−1 − 1 − 1 + 0) = − .
4 4 4
Likewise check that V (i) = − 47 for all i ∈ S that are adjacent to the terminal states and the
rest are −2.
• For the third iteration (k = 3),
1 1 7 39
V (1) ← (−1 − 1 − 1 − 1) + − −2−2+0 =− .
4 4 4 16
31
39
Likewise, check that V (i) = − 16 for all i ∈ S that is adjacent to the terminal states and the
47
rest are − 16 but the states surrounded by the −2 becomes −3 and so-on.
• These values will converge to the optimal cost-to-go function V ∗ (but after attaining values
like −20, really slow convergence).
32
10 Lecture 10: Policy Iteration (Shalabh Bhatnagar)
We will now discuss the Gauss-Seidel value iteration.
For i = 2, 3, . . . , n,
Xn i−1
X n
X
(F J)(i) = min pij (u)g(i, u, j) + pij (u)(F J)(j) + pij (u)J(j) .
u∈A(i)
j=1 j=1 j=i
As a remark,
• (F J)(1) = (T J)(1).
• limk→∞ F k J = J ∗ for all J ∈ R|S| .
There is another procedure called Policy Iteration. In value iteration, we start with some J ∈ R|S|
and repeatedly apply the operator T . In policy iteration, we start with some policy µ and update
the policy at each iteration. The procedure is as follows:
1. Start with an initial policy µ0 .
2. Policy evaluation: Given a policy µk , compute J µk (i), i ∈ S as the solution to the
n
X
J(i) = pij (µk (i)) (g(i, µk (i), j) + J(j)) ∀ i ∈ S,
j=1
THe J µk is estimated from the policy evaluation. Alternatively, one can solve
Tµk+1 J µk = T J µk .
4. Keep iterating the policy evaluation and policy improvement steps. Since the number of
policies is finite, the policy iteration algorithm will converge to the optimal policy in a finite
number of steps. Or one can stop the iteration when a tolerance criterion is met.
33
The above structure is like a nested loop. The outer loop is the policy iteration and the inner loop
is the policy evaluation.
Starting from an initial given J0 (·), update
n
X
Jℓ+1 (i) = pij (µ(i)) (g(i, µ(i), j) + Jℓ (j)) ∀i∈S Jℓ → J µk .
j=1
Repeat the process if J µk+1 (i) < J µk (i) for any i ∈ S. If for all i ∈ S, J µk+1 (i) = J µk (i), then stop
the iteration and output the policy µk+1 as the optimal policy.
Proposition The policy iteration algorithm generates an improving sequence of proper policies,
i.e.,
J µk+1 (i) ≤ J µk (i) ∀ i ∈ S, k ∈ N
It terminates with an optimal policy µ∗ in a finite number of steps.
Proof. Given a proper policy µ, the new policy µ̄ is obtained via policy improvement as
Tµ̄ J µ = T J µ .
Then,
J µ = Tµ J µ ≥ T J µ = Tµ̄ J µ .
In particular, J µ ≥ Tµ̄ J µ . By monotonicity of Tµ̄ , we have
We know that µ is proper. How do we show that µ̄ is proper? Suppose for the sake of contradiction,
µ̄ is improper. Then, ∃ i ∈ S such that J µ̄ (i) = ∞ (assumption B). For that same i, then, J µ (i) = ∞
(as J µ ≥ J µ̄ ). This is a contradiction as µ is proper. Hence, µ̄ is proper.
Suppose µ is not optimal. Then, ∃ i ∈ S such that J µ̄ (i) < J µ (i). Otherwise, J µ = J µ̄ . In the
latter case,
J µ = J µ̄ = Tµ̄ J µ̄ = T J µ̄ = J µ = T J µ =⇒ J µ = J ∗ (J µ = T J µ ).
Hence, µ is optimal, and the new policy is strictly better than the current policy if the current
policy is not optimal. Since the number of proper policies is finite, the policy iteration algorithm
will terminate in a finite number of steps, giving an optimal proper policy.
34
10.3 Multi Stage Look ahead Policy Iteration
Regular Policy Iteration uses a one-step look ahead and finds the optimal decision for a one-stage
problem with one stage cost g(i, u, j) and terminal cost J µ (j) when the policy is µ.
In m stage look ahead problem, we find the optimal policy for an m-stage dynamic programming
problem where we start in state i ∈ S, make m subsequent decisions incurring corresponding costs
of m stages and getting a terminal cost J µ (j), where j is the state reached after m stages.
Claim: The m-stage policy iteration terminates with the optimal policy under the same conditions
as regular policy iteration.
• For k = m − 1:
µ µ
Tµm−1
¯ J = TJ .
• For k = m − 2:
Tµ̄m−2 T J µ = T 2 J µ .
• For k = m − 3:
Tµ̄m−3 T 2 J µ = T 3 J µ .
And so on.
• For k = 0:
Tµ̄0 T m−1 J µ = T m J µ .
T k+1 J µ ≤ T k J µ ≤ J µ ∀ k ∈ N.
Hence,
Tµ̄ T m−k−1 J µ = T m−k J µ ≤ T m−k−1 J µ ∀ k = 0, 1, . . . , m − 1.
Thus, ∀ ℓ ≥ 1, we have
Thus, for a successor policy µ̄ generated by the m-stage policy iteration, i.e., µ̄ = µ̄0 , we have
J µ̄ ≤ T m J µ ≤ J µ (Set k = 0 in ∗).
35
11 Lecture 11: Infinite horizon discounted horizon(Shalabh Bhat-
nagar)
Scribe: Sahil
We will set the indexes to a no-termining state. Let the states be denoted by {1, 2, · · · , n}.
A(i) = set of feasible direction in state i
A = ∪i∈S A(i) = set of all actions Further, we assume |S| < ∞, |A| < ∞.
Let us define,
"∞ #
X
J k (i) = min αk y(ik , µ(ik ), ik+1 |i0 = i ,
µ
k=0
where 0 < α < 1 is called the discounted factor.
Tµ J = gµ + αPµ J = J (1)
36
Let e = (1, 1, · · · , 1). Then, for any vector J = (J(1), J(2), · · · , J(n)) and r ∈ R,
n
X
(T (J + re))(i) = min Pij (µ) (g(i, µ, j) + α(J + re)(j))
µ∈A(i)
j=1
n
X
= min Pij (µ)(g(i, µ, j) + αJ(j)) + αr
µ∈A(i)
j=1
= T J(i) + αr
= T J + αre
Lemma 11.2. For every k, vector J, stationary µ and scalar r,
(T k (J + re))(i) = (T k J)(i) + αk r , i = 1, · · · , n, k ≥ 1 (4)
(Tµk (J + re))(i) = (Tµk J)(i) + αk r , i = 1, · · · , n, k ≥ 1 (5)
Proof will follow from induction. Complete the proof.
We can convert a DCP to a SSPP by adding a termination state.
ADD the state addition here!!!!!
Probability of termination in the first stage = 1 − α
Probability of termination in the second stage = α(1 − α)
..
.
Probability of termination in the k th stage = αk−1 (1 − α)
Probability of non termination event in k th stage is given by,
Pk = 1 − (1 − α)(1 + α + · · · + αk−1 )
(1 − αk )
= 1 − (1 − α)
(1 − α)
= αk
Pn
Expected Single stage cost in k th stage=αk j=1 Pij g(i, µ, j).
Note: All policies are proper for the associated SSPP since from every state under every policy,
there is a probability of 1 − α of termination.
Note also that for DCP, the expected single stage cost at k th stage,
n
X
k
=α Pij (µ)g(i, µ, j)
j=1
Under Policy µ,
∞
" #
X
SSPP:Jµ (i) = E g(ik , µ(ik ), ik+1 )|i0 = i (6)
k=1
"∞ #
X
DCP:Jµ (i) = E αk g(ik , µ(ik ), ik+1 )|i0 = i (7)
k=1
37
Proposition 11.3. For any bounded J : S → R, the optimal cost function satisfies J ∗ (i) =
limN →∞ (T N J)(i) for all i ∈ S
Proof. Consider a policy π = {µ0 , µ1 , · · · , } with µk : S → A such that µk (i) ∈ A(i) for all
i ∈ S, k ≥ 0. The,
"N −1 #
X
Jπ (i) = lim E αk g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0
−1
"K−1 N
#
X X
= lim E αk g(ik , µk (ik ), ik+1 ) + αk g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0 k=K
"K−1 # "N −1 #
X X
k k
=E α g(ik , µk (ik ), ik+1 )|i0 = i + lim E α g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0 k=K
Thus,
"K−1 # "N −1 #
X X
k k
E α g(ik , µk (ik ), ik+1 )|i0 = i = Jπ (i) − lim E α g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0 k=K
"K−1 #
αk M X
Jπ (i) − − αk max |J(j)| ≤ E αk g(ik , µk (ik ), ik+1 ) + αk I(ik )|i0 = i
(1 − α) j∈S
k=0
αk M
≤ Jπ (i) + + αk max |J(j)|
(1 − α) j∈S
Taking min over π on all sides, we have for all i ∈ S, and K > 0,
αk M αk M
J ∗ (i) −− αk max |J(i)| ≤ (T k J)(i) ≤ J ∗ (i) + − αk max |J(i)| (8)
1−α j∈S 1−α j∈S
αk M αk M
lim J ∗ (i) − − αk max |J(i)| ≤ lim (T k J)(i) ≤ lim J ∗ (i) + − αk max |J(i)| (9)
K→∞ 1−α j∈S K→∞ K→∞ 1−α j∈S
K ∗
lim (T J)(i) = J (i) (10)
K→∞
38
Proof. Consider an alternative MDP where,
A(i) = {µ(i)} , ∀i ∈ S
Moreover, J ∗ is the unique solution of this equation within the class of bounded functions.
αk M αk M
J ∗ (i) − − αk max |J(i)| ≤ (T k J)(i) ≤ J ∗ (i) + − αk max |J(i)|
1−α j∈S 1−α j∈S
k
α M k
α M
T J ∗ (i) − − T αk max |J(i)| ≤ (T K+1 J)(i) ≤ T J ∗ (i) + T − T αk max |J(i)| (Apply T)
1−α j∈S 1−α j∈S
k
α M k
α M
lim T J ∗ (i) − − T αk max |J(i)| ≤ lim (T K+1 J)(i) ≤ lim T J ∗ (i) + T − T αk max |J(i)|
K→∞ 1−α j∈S K→∞ K→∞ 1−α j∈S
K+1 ∗
lim (T J)(i) = T J (i)
K→∞
lim T k Jˆ = J ∗
k→∞
Jˆ = J ∗
39
12 Lecture 12: Infinite horizon discounted horizon(Shalabh Bhat-
nagar)
Scribe: Sahil
Corollary 12.0.1. Bellman Equation for a given policy:
FOr every stationary policy µ the associated cost function satisfies,
X
Jµ (i) = Pij (µ(i))(g(i, µ(i), j) + αIµ (i)), ∀i ∈ S
j∈S
Moreover Iµ is the unique solution to this equation within the class of bounded functions.
Proposition 12.1. Necessary and Sufficient Conditions:
A stationary policy µ is optimal if and only if µ(i) attains the minimum in the belman equation ∀
i ∈ S, i.e.
T J ∗ = Tµ J ∗
J ∗ = Tµ J ∗ = T J ∗
Thus,
T J ∗ = Tµ J ∗ (12)
40
Corollary 12.2.1. Rate of Convergence of Value Iteration
For any bounded function J : S → R. We have,
Example: A machine can be in one of n states 1, 2, · · · , n where 1 is the best state and n is the
worst state.
• Suppose the transition probabilities Pij are known
• Cost of operating machine for one period is g(i) when state of machine is i.
Now define your action as following:
(
Select ”O”, Operate Machine
action =
Select ”C”, Replace by a new Machine
Cost incurred when C is chosen is R. Once replaced new machine is guaranteed to stay in state 1
for one period. Suppose α ∈ (0, 1) is a given discount factor. The bellman equation for the given
system is given by:
n
X
∗ ∗
J (i) = min{R + g(1) + αJ (1), g(i) + α Pij J ∗ (j)}
j=1
Then,
( Pn
Use action C , if R + g(1) + αJ ∗ (1) < g(i) + α j=1 Pij J
∗ (j)
Optimal Policy =
O, otherwise
Note: Assume,
1. g(1) ≤ g(2) ≤ · · · ≤ g(n)
2. Pij = 0 if j < i
3. Pij ≤ P(i+1)j if i < j
Then,
X X
Pij J ∗ (j) ≤ P(i+1)j J ∗ (j)
j j
X X
∗
g(i) + α Pij J (j) ≤ g(k) + α Pkj J ∗ (j), i<k
j j
Pn
Let SR = {i ∈ S|R + g(1) + αJ ∗ (1) ≤ g(i) + α j=1 Pij J
∗ (j)} Let
(
∗ Smallest State in SR , if SR is non empty
i =
(n + 1), if SR is empty
41
Recall that,
n
X
(Tµ J)(i) = Pij (µ(i))(g(i, µ(i), j) + αI(j)) , i ∈ S
j=1
n
X n
X
= Pij (µ(i))g(i, µ(i), j) + α Pij (µ(i))J(j) ,i ∈ S
j=1 j=1
n
X
= ĝ(i, µ(i)) + α Pij (µ(i))J(j) ,i ∈ S
j=1
Let,
ĝ(1, µ(1)) P11 (µ(1)) P12 (µ(1)) · · · P1n (µ(1))
ĝµ =
..
Pµ =
.. .. ..
. . . ··· .
ĝ(n, µ(n)) Pn1 (µ(1)) Pn2 (µ(2)) · · · Pnn (µ(n))
J µ = ĝµ + αPµ J µ
(I − αPµ )J µ = ĝµ
J µ = (I − αPµ )−1 ĝµ
Please note that this is only valid for a fixed value iteration.
Also,
Recall that,
∞
X
J µ (i) = E αk g(ik , µ(ik ), ik+1 )|i0 = i
µ=0
∞
X
= ĝ(i, µ(i)) + αk E[g(ik , µ(ik ), ik+1 )|i0 = i]
k=1
β ≤ ĝ(i, µ(i)) ≤ β ∀i
42
In vector Notation,
αβ αβ
ĝµ + e ≤ Jµ ≤ ĝµ + e
1−α 1−α
β αβ αβ β
e ≤ ĝµ + e ≤ Jµ ≤ ĝµ + e≤ e
1−α 1−α 1−α 1−α
Given a vector J, We know that Tµ J = ĝµ + αPµ J. Subtracting the above from J µ = ĝµ + αPµ J µ ,
we get,
J µ − Tµ J = αPµ (J µ − J)
J µ − J = (Tµ J − J) + αPµ (J µ − J)
Thus, if cost per stage vector is Tµ J − J, then J µ − J is the cost to go vector. Then,
γ αγ αγ γ
e ≤ Tµ J − J + e ≤ J µ − J ≤ Tµ J − J + e≤ e
1−α 1−α 1−α 1−α
where,
where,
α h i
Ck = min (T k J)(i) − (T k−1 J)(i)
1−α i h
α i
Ck = max (T k J)(i) − (T k−1 J)(i)
1−α i
43
13 Lecture 13: Online Lecture (Shalabh Bhatnagar)
Scribe: Sahil
Topics to cover:
1. Policy Iteration for Discounted Cost
2. Monte Carlo Technique
3. Temporal Difference Learning Algorithms(full state case)
Proposition 13.1. Let us assume that µ and µ be two stationary policies such that,
Tµ J µ = T J µ
or equivalently,
n
X n
X
g(i, µ(i)) + α Pij (µ(i))J µ (j) = min g(i, u) + α Pij (µ)J µ (j) , ∀i ∈ [n]
u∈A(i)
j=1 j=1
Tµ J µ = T J µ
i.e
n
X
µ
∀i, J (i) = g(i, µ(i)) + α Pij (µ(i))J µ (j)
j=1
Xn
≥ g(i, µ(i)) + α Pij (µ(i))J µ (j)
j=1
µ
= Tµ J (i)
Thus,
J µ = Tµ J µ ≥ T J µ = Tµ J µ
J µ ≥ Tµ J µ
If J µ = J µ , then,
J µ = J µ = Tµ J µ = Tµ J µ = T J µ = T J µ since Tµ J µ = T J µ
∴ Jµ = T Jµ & Jµ = T Jµ
44
Since, T has a unique fixed point(since T is a valid contraction),
or solve
k k
J µ = g µk + αPµk J µ
k
= Tµk J µ
g(1, µk (1))
Pn k k
g(2, µk (2)) j=1 P (1, µ (1), j)g(1, µ (1), j)
g µk = ..
=
.. .
. Pn k k
g(n, µk (n)) j=1 P (n, µ (n), j)g(n, µ (n), j)
Also,
P (1, µk (1), 1) · · · P (1, µk (1), n)
Pµ k =
.. .. ..
. . .
k k
P (n, µ (n), 1) · · · P (n, µ (n), n)
45
13.2 Recap of story
• Basics of RL
• Multi armed Bandits (single state with multiple actions)
1. Greedy strategy
2. ϵ−greedy strategy
3. UCB Exploration
4. Gradient based search
• Markov Decision Process:
we assume knowledge of system model i.e. transition probabilities, reward function etc. In
MDPs, we have covered,
1. Finite Horizon Problems (N < ∞ but deterministic)
2. Stochastic Shortest Path Problems(N < ∞ but random)
3. Discounted Cost Problems(N = ∞)
Algorithms covered:
1. Dynamic Programming Algorithm(Finite Horizon Problems)
2. Bellman Equation(Stochastic Shortest Path Problems and Discounted Cost Problems)
– Value Iteration
– Policy Iteration
46
Monte carlo method can also be written as an update rule:
n
1 X
Vn (s) = Gm , n ≥ 1 when s0 = s
n
m=1
n+1
1 X
Then, Vn+1 (s) = Gm
n+1
m=1
n
n 1 X n 1
= Gm + Gn+1
n+1n n+1n
m=1
n 1
= Vn (s) + Gn+1
n+1 1+n
1
= Vn (s) + (Gn+1 − Vn (s))
n+1
In general, one may let
Vn+1 (s) = Vn (s) + αn (Gn+1 − Vn (s))
where {αn }n≥0 are step sizes or learning rate such that,
X X
αn = ∞, a2n < ∞
n n
Note that: As n → ∞,
Vn (s) → Eµ [Gn+1 |sn+1 = s]
= J µ (s)
Recall that,
Vn+1 (sn ) = Vn (sn ) + αn (Gn+1 − Vn (sn ))
N
X −n
= Vn (sn ) + αn ( Rn+j − Vn (sn ))
j=1
N
X −n
= Vn (sn ) + αn (Rn+j + Vn (sn+j ) − Vn (sn+j−1 ))
j=1
47
Let di = Rn+i + Vn (sn+i ) − Vn (sn+i−1 ). These quantities are referred to Temporal difference
terms or also as, Temporal Error.
Then,
N
X −n
Vn+1 (sn ) = Vn (sn ) + αn dj
j=1
48
14 Lecture 14: Temporal Difference Learning (Shalabh Bhatna-
gar)
The work started by Rick sutton in 1984 for his PhD thesis.
Recall Monte carlo scheme tries to solve for,
The problem is we don’t know the expectation and we resort to solving this by TD recursion,
Alternatively,
and,
αn > 0, ∀n
P
condition on {αn } : n αn = ∞ ⇒ t(n) → ∞ as n → ∞
P 2
n αn < ∞
49
Plot a graph of Vn vs t(n), all these points are discrete. Draw a line between these points. One
intuition you can get to approximate these would be to consider ODE and looking at its asymptotic
behaviour.
One can show that,
V̇ (t) = DV (t)
γ(1) 0 ··· 0
0 γ(2) · · · 0
where D = .
.. ..
··· . 0
0 · · · · · · γ(n) n×n
Pn
j=1 P1j (π(1))(Rπ (1, j) + Vπ (j) − Vπ (1))
V (t) =
..
Pn .
j=1 Pnj Pnj (π(n))(Rπ (n, j) + Vπ (j) − Vπ (n))
Since value of l is arbitrary, we can form a weighted average of all such bellman equations.
50
P∞
Let λ ∈ [0, 1). Since l=0 (1 − λ)λl = 1, we can write the following bellman equation,
∞ l
" !#
X X
Vπ (ik ) = (1 − λ)Eπ λl r(ik+m , ik+m+1 ) + Vπ (ik+l+1 )
l=0 m=0
"∞ l
# " ∞
#
X X X
= (1 − λ)Eπ λl r(ik+m , ik+m+1 ) + (1 − λ)Eπ λl Vπ (ik+l+1 )
l=0 m=0 l=0
∞ ∞ ∞
" # " #
XX X
l
= (1 − λ)Eπ λ r(ik+m , ik+m+1 ) + Eπ (λl − λl+1 )Vπ (ik+l+1 )
m=0 l=m l=0
∞ ∞
" #
X X
= (1 − λ)Eπ r(ik+m , ik+m+1 ) λl + Eπ [(1 − λ)Vπ (ik+1 ) + (λ − λ2 )Vπ (ik+2 ) + · · · ]
m=0 l=m
∞
" #
X
= Eπ λm r(ik+m , ik+m+1 ) + E[Vπ (ik+1 ) − Vπ (ik ) + λ(Vπ (ik+2 ) − Vπ (ik+1 )) + · · · ]
m=0
∞ ∞
" #
X X
m
= Eπ λ r(ik+m , ik+m+1 ) + E[ λm (Vπ (ik+m+1 ) − Vπ (ik+m ))] + Vπ (ik )
m=0 m=0
∞
" #
X
m
= Eπ λ (r(ik+m , ik+m+1 ) + Vπ (ik+m+1 ) − Vπ (ik+m )) + Vπ (ik )
m=0
Letting dm = r(im , im+1 )+Vπ (im+1 )−Vπ (im ) and these are called temporal difference terms. Then,
" ∞ #
X
Vπ (ik ) = Eπ λm (r(ik+m , ik+m+1 ) + Vπ (ik+m+1 ) − Vπ (ik+m )) + Vπ (ik )
m=0
∞
" #
X
0 = Eπ λm dm+k
m=0
E[dm ] = 0
51
15 Lecture 15: Q Learning (Shalabh Bhatnagar)
But suppose we have access states j ∼ Pi,· (u) for all i ∈ S and u ∈ A(i). Then, the Q-Learning
algorithm is
Qm+1 (i, u) = Qm (i, u) + γm g(i, u, j) + min Qm (j, v) − Qm (i, u) ,
v∈A(j)
2
P P
where γ ∼ pi,· (u). The γm is selceted such that m γm = ∞ and m γm < ∞.
Proposition Consider the following algorithm:
where
P P 2
1. t rt (i) = ∞ and t rt (i) < ∞.
where ī ∼ Pi,· (u). Let Qt (0, u) = 0 for all u ∈ A(0). LetPT i,u denote the set
P of all times at which
Q(i, u) is updated. Let γt (i, u) = 0 for all t ̸∈ T and t γt (i, u) = ∞, t γt2 (i, u) < ∞. Then,
i,u
a.s.
Qt (i, u) −−→ Q∗ (i, u) as t → ∞ for all i ∈ S and u ∈ A(i) in both the following cases:
(i) All policies are proper.
(ii) Assumptions (A) and (B) hold.
Qt+1 (i, u) = (1 − γt (i, u))Qt (i, u) + γt (i, u) ((HQt )(i, u) + ωt (i, u)) ,
where
n
X
ωt (i, u) = g(i, u, ī) + min Qt (ī, v) − pij (u) g(i, u, j) + min Qt (j, v) .
v∈A(ī) v∈A(j)
j=1
52
Observe that E [ωt (i, u) | Ft ] = 0 for all i ∈ S and u ∈ A(i). Furthermore, ∃ a constant k > 0 such
that
2 2
E ωt (i, u) | Ft ≤ k 1 + max |Qt (j, v)| .
j∈S,v∈A(j)
Then, assumption (B) holds. Suppose now that all policies are proper. Then, we have shown that
∃ ξ(i) > 0 for all i ̸= 0 and β ∈ [0, 1) such that
n
X
pij (u)ξ(j) ≤ βξ(i) ∀ i ̸= 0, u ∈ A(i).
j=1
Let
Q = (Q(i, u), i ∈ S, u ∈ A(i))T .
|Q(i,u)|
Define the norm ||Q||ξ = maxi∈S,u∈A(i) ξ(i) . Consider two vectors Q and Q̄. Then,
n
X
||(HQ)(i, u) − (H Q̄)(i, u)||ξ ≤ pij (u) min Q(j, v) − min Q̄(j, v) .
v∈A(j) v∈A(j)
j=1
Then,
n
X
||(HQ)(i, u) − (H Q̄)(i, u)||ξ ≤ pij (u) min Q(j, v) − min Q̄(j, v)
v∈A(j) v∈A(j)
j=1
Xn
≤ pij (u) max Q(j, v) − Q̄(j, v)
v∈A(j)
j=1
n
! !
X Q(j, v) − Q̄(j, v)
≤ pij (u) max ξ(j)
v∈A(j) ξ(j)
j=1
n
!!
X Q(j, v) − Q̄(j, v)
≤ pij (u) max · ξ(j)
v∈A(j) ξ(j)
j=1
Xn
≤ pij (u) max ||Q − Q̄||ξ · ξ(j)
v∈A(j)
j=1
n
X
≤ β||Q − Q̄||ξ · ξ(i) as pij (u)ξ(j) ≤ βξ(i) .
j=1
53
Therefore, H is a weighted max norm pseudo-contraction. The result follows from the general
result.
We will now show that
Note that if A ⊂ B then inf x∈A f (x) ≥ inf x∈B f (x). Therefore,
Similarly, we can show that inf x∈A f (x) − inf x∈A g(x) ≤ supx∈A |f (x) − g(x)|. Therefore,
Now, we have
Suppose state St is visited at time t. Then, the Q-Learning algorithm in the online setting is
Qt+1 (St , At ) = Qt (St , At ) + γt (St , At ) g(St , At , St+1 ) + min Qt (St+1 , v) − Qt (St , At ) ,
v∈A(St+1 )
54
The question is how do we select At in the update rule? The answer is to select At randomly from
the set A(St ). An alternative way of rewriting the above is
One possibility is
(
arg minv∈A(St ) Qt (St , v) with probability 1 − ϵ,
At =
randomly selected from A(St ) with probability ϵ.
and
At+1 = arg min Qt (St+1 , v).
v∈A(St+1 )
where
(
arg minv∈A(St ) Qt (St , v) with probability 1 − ϵ,
At =
randomly selected from A(St ) with probability ϵ.
(
arg minv∈A(St+1 ) Qt (St+1 , v) with probability 1 − ϵ,
At+1 =
randomly selected from A(St+1 ) with probability ϵ.
They are called the off-policy algorithm (Q-Learning) and the on-policy algorithm (SARSA). Off-
policy algorithms are more popular in practice. Refer to the book by Sutton and Barto for more
details on Double Q-Learning, expected SARSA, etc.
Professor Gugan will do the function approximation method and the basics of stochastic approxi-
mation algorithms. Whatever he will teach will involve Lipschitz continuity (such as policy gradient
methods), but largely, he will cover function approximation methods.
55
16 Lecture 16
Rohit
Gonna write Monday night
56
17 Lecture 17: Application Of Stochastic Approximation To RL
Given a µ, our goal is to approximate Jµ given by:
"∞ #
X
Jµ = E γ t r(st , at )|s0 = s
t=0
X
µ(a|s)P (s′ |s, a) r(s, a) + γJµ (s′ )
Jµ (s) =
a,s′
In this setting, our goal is to find Jµ , we have to solve this system of s linear equations in s
unknowns.(Assuming we know the probabilities P and µ).
In the model free setup, we don’t know P. We try to exploit Laws of Large numbers here. After a
few runs, by SLLN,
X1 + X2 + . . . Xn a.s
−−→ E[X]
n
Jµ is an infinite sum. How do we get samples? We call one sample as one run till termination,
−1
TP
(s0 , a0 , r(s0 , a0 ), s1 , a1 , r(s1 , a1 ), . . . sT ) and calculate γ t r(st , at ) + γ T r(sT ).
t=0
−1
TP
For each s, we collect k samples of γ t r(st , at ) + γ T r(sT ) where s0 = s.
t=0
C1 (s)+C2 (s)...Ck (s)
Call it C1 (s), C2 (s) . . . Ck (s). Jµ (s) = k
With this naive approach, both space and time grows linearly with n. This approach is non-
incremental, i.e. you don’t reuse your samples.
X1 + X2 + . . . Xn
xn =
n
We can rewrite this as:
(n − 1)xn−1 + Xn 1
xn = = xn−1 + (Xn − xn−1 )
n n
More generally,
xn = xn−1 + αn (Xn − xn−1 )
∇f (x) = x − E[x]
57
Let’s do a gradient descent:
We can relate this to the above obtained result. So what we got before is stochastic gradient
descent.
So we can write:
xn = xn−1 + αn (Xn − xn−1 )
ˆ (s) + α (C − J n−1
Jˆµn (s) = Jµn−1 ˆ (s))
n n µ
17.1 TD Algo
1 1X
f (x) = ||Jµ − x||2D = d(s)[Jµ (s) − x(s)]2
2 2 s
X
xn+1 = xn + αn (−∇f (xn )) = xn + αn ( d(s)[Jµ (s) − xn (s)]es )
s
X
xn+1 = xn + αn ( d(s)µ(a|s)P(s′ |s, a)([r(s, a) + γJµ (s′ ) − xn (s)]es ))
s,a,s′
We do not know Jµ (s′ ), an infinite sum but substitute it with xn (s′ ). But now this cannot be
viewed as the gradient of the earlier objective function.
We can view d(s)µ(a|s)P(s′ |s, a) as the distribution of (s, a, s′ ).
where,
sn ∼ d, an ∼ µ(.|sn ), s′n ∼ P(.|sn , an )
This is the TD-0 algorithm. Assume we can sample from d; this is model-free. We do not know P
but we should be able to sample from d.
58
If we allow a markov chain to evolve, then we get the stationary distribution, which is d.
The algo we obtained at the end is slightly different from the previous algo, written as:
Prev Algo(TD-0 with markov sampling):
This cannot be expressed as the gradient of any function. So they are studied under stochastic
approx algos.
59
18 Lecture 3: Temporal Difference Learning and Function Ap-
proximation
18.1 Markov Decision Processes (MDP)
To describe an MDP, we need the tuple:
(S, A, P, r, γ)
(S, P )
The transition probabilities differ in the MDP setting. We denote the MC transition probability
as: X
Pµ (s′ |s) = µ(a|s)P (s′ |s, a)
a
If we start at some arbitrary state s0 and allow the Markov chain to evolve, after n steps we reach
sn . If the Markov chain is well-behaved, it converges to a stationary distribution dµ , independent
of n, satisfying:
dTµ Pµ = dTµ
Φ ∈ RS×d , d≪S
x ∈ col(Φ)
New goal: Find θ∗ such that:
Jµ ≈ Φθ∗
60
Gradient descent update:
θn+1 = θn + αn [−∇f (θn )]
Update rule: X
dµ (s) Jµ (s) − ΦT (s)θ Φ(s)
θn+1 = θn + αn
s
We get,
X
dµ (s)µ(a|s)P (s′ |s, a) r(s, a) + γJµ (s′ ) − ΦT (s)θn Φ(s)
θn+1 = θn + αn
s,a,s′
where:
sn ∼ dµ (·)
an ∼ µ(·|sn )
s′n ∼ P (·|sn , an )
In tabular TD(0), the whole operation happens in an s-dimensional space, whereas here it happens
in a d-dimensional space. This reduces time complexity.
θ0 , θ1 , . . . , θn ∈ Fn
61
which means they are measurable with respect to Fn .
Now, let:
δn = r(sn , an ) + γΦT (s′n )θn − ΦT (sn )θn
We define:
h(θn ) = E [δn Φ(sn ) | Fn ]
Expanding it:
We make the assumption that sn is independent of the conditioning term (as sn is sampled fresh):
Finally we get:
b = ΦT Dµ rµ , Aθ = γΦT Dµ Pµ Φθn − ΦT Dµ Φθn
where:
h(θ) = b − Aθ
b = ΦT Dµ rµ , A = ΦT Dµ (I − γPµ )Φ
with the condition:
E[Mn + 1|Fn ] = 0
62
so,
θn+1 = θn + αn [b − Aθn + Mn+1 ]
This can be viewed as a noisy Euler approximation of:
63
19 Lecture 19 (Gugan Thoppe)
Recall that we assumed that the feature matrix Φ is given to us. We want to minimize
1
f (θ) = ||Jµ − Φθ||2 .
2
We came up with the update rule
θn+1 = θn + αn r(sn , an ) + γΦT (s′n )θn − ΦT (sn )θn Φ(sn ).
64
19.1 Is θ∗′ asymptotically stable?
19.1.1 Lyaupnov functions
Let V : Rd 7→ R be given by
1
V (θ) = ||θ − θ∗ ||2 .
2
If ∇V (θ(t))T h(θ) < 0 for all θ ̸= θ∗ then
dV (θ(t))
= ∇V (θ(t))T θ′ (t) = ∇V (θ(t))T h(θ(t)) < 0
dt
for all θ ̸= θ∗ . In this case any trajectory starting from θ(0) ̸= θ∗ will converge to θ∗ . The function
V is called a Lyapunov function. Now let us verify this condition.
∇V (θ) = θ − θ∗
∇V (θ)T h(θ) = (θ − θ∗ )T (b − Aθ) (assuming A is invertible)
T
= −(θ − θ∗ ) A(θ − θ∗ ).
If we can show that A is positive definite then it is invertible and the above expression is negative.
Lemma θT Aθ > 0 for all θ ∈ Rd \ {0}.
θT Aθ = θT ΦT Dµ (I − γPµ )Φθ
= y T Dµ (I − γPµ )y where y = Φθ.
We will assume henceforth Φ has full column rank. Then, it suffices to show that B = Dµ (I − γPµ )
is positive definite, i.e., for all y ̸= 0,
y T Dµ y − γy T Dµ Pµ y > 0.
y T Dµ y − γy T Dµ Pµ y ≥ y T Dµ Pµ y − γy T Dµ Pµ y
= (1 − γ)y T Dµ y > 0.
We further assume that the stationary distribution µ is positive to satisfy y T Dµ y > 0 for all y ̸= 0,
which holds if the chain is irreducible and recurrent.
Proof.
1 1
y T Dµ Pµ y = (y T Dµ2 )(Dµ2 Pµ y)
1 1
≤ ||Dµ2 y|| · ||Dµ2 Pµ y|| = ||y||Dµ · ||Pµ y||Dµ .
65
Proof. The left-hand side is
X 2
Dµ (s) PµT (s, ·)y .
s
Recall the Jensen’s inequality, for any convex function f , f (E[X]) ≤ E[f (X)]. Let f (x) = x2 . Then,
T
2 2
≤ s Dµ (s) PµT (s, ·)y . This completes
P P
f (E[X]) ≤ E[f (X)] implies that s Dµ (s)Pµ (s, ·)y
the proof.
One would expect that the noisy algorithm would behave in a similar manner but before that
should we be excited about θ∗′ ?
Lemma Recall θ∗′ = A−1 b, where A = ΦT Dµ (I − γPµ )Φ and b = ΦT Dµ rµ . Then,
Φθ∗′ satisfies the projected Bellman equation and is the fixed point of the projected Bellman oper-
ator, ΠTµ , where Π = Φ(ΦT Dµ Φ)−1 ΦT Dµ .
The geometrical interpretation of the above lemma is that θ∗′ is the projection of Φθ∗ onto the
column space of Φ. Verify that
1
||J − ΠV ||2Dµ = min ||J − Φθ||2Dµ .
θ 2
The projected Bellman operator ΠTµ is a contraction in the Dµ norm. The fixed point of the
projected Bellman operator is the projection of the optimal value function onto the column space
of Φ.
66
20 Lecture 20 (Gugan Thoppe)
Recall
1 1X 2
f (θ) = ||Jµ − Φθ||2 = Dµ (s) Jµ (s) − Φ(s)T θ .
2 2 s
The gradient of f is X
Dµ (s) Jµ (s) − Φ(s)T θ Φ(s).
∇f (θ) = −
s
The Hessian is X 1
∇2 f (θ) = Dµ (s)Jµ (s)Φ(s) = θT ΦT Dµ Φθ > 0.
s
2
Thus, the optimal θ∗ is given by ∇f (θ∗ ) = 0. The optimal θ∗ is the unique minimizer of f given by
θ∗ = (ΦT Dµ Φ)−1 ΦT Dµ Jµ .
Thus,
Φθ = Φ(ΦT Dµ Φ)−1 ΦT Dµ Jµ .
This Φθ is the closest point in the column space of Φ to Jµ in the Dµ norm. We call Π =
Φ(ΦT Dµ Φ)−1 ΦT Dµ the projection operator. The projected Bellman operator is ΠJµ .
Φθ∗′ is the fixed point of the equation ΠTµ Φθ = Φθ. Now we wish to compare ∥Jµ − Φθ∗′ ∥Dµ with
∥Jµ − Φθ∗ ∥Dµ . From th definition of θ∗ , we get
Now,
Recall that Tµ is a contraction map in some norm and since the space is finite dimensional, all
norms are equivalent. Thus, ∥Tµ Jµ − Tµ Φθ∗ ∥Dµ ≤ γ∥Jµ − Φθ∗ ∥Dµ . Therefore,
Hence,
1
∥Jµ − Φθ∗ ∥Dµ ≤ ∥Jµ − ΠJµ ∥Dµ .
1−γ
If γ is close to 1, the distance between ∥Jµ − Φθ∗ ∥Dµ and ∥Jµ − ΠJµ ∥Dµ can be very large.
Claim (θn )n≥0 generated using the noisy algorithm
67
Proof. We will verify the four assumptions of the result proved by Michael Bonaim in 1996 (second
chapter of Borkar).
(A1) Let h(x) = b − Ax. Then, h is Lipschitz continuous with constant L = ||A|| as
P P 2 1
(A2) Choose step sizes αn such that n αn = ∞ and n αn < ∞, such as αn = np where
p ∈ (0.5, 1]. In general, step sizes are kept constant for some time and then decreased to zero
to obtain a faster convergence rate. If step size is kept constant the noise term won’t go to
zero. If it is decreased too-fast then rate of convergence would be very slow. The choice of
step size is a trade-off between these two and is domain specific.
(A3) (Mn )n≥1 be a square integrable martingale difference sequence with respect to the filtration
(Fn )n≥1 , i.e.,
E∥Mn ∥2 < ∞, E[Mn+1 | Fn ] = 0.
Furthermore,
E ∥Mn+1 ∥2 | Fn ≤ K 1 + ∥θn ∥2 .
The assumption E[Mn+1 | Fn ] = 0 follows as Mn+1 = δn Φ(θn ) − (b − Aθn ). We will now verify
the third assumption. We have
The first term is δn Φ(θn ) = r(sn , an )Φ(θn ). Thus, ∥r(sn , an )Φ(θn )∥ ≤ |r(sn , an )|∥Φ(θn )∥. If
all rewards are bounded, i.e., |r(sn , an )| ≤ Rmax and assume that the feature space is upper
bounded by one (wlog as we can always normalize the feature space), then ∥r(sn , an )Φ(θn )∥ ≤
Rmax . Now, b = E[r(sn , an )Φ(θn ) | Fn ]. Thus, ∥b∥ ≤ Rmax . Further,
Therefore,
We will show (A4) and part (a) of (A3) in the next lecture.
68