0% found this document useful (0 votes)
12 views68 pages

RL Class Notes

The document contains lecture notes for a Reinforcement Learning course taught by Pratyush Kant and Sahil Chaudhary in Spring 2025, detailing administrative information, grading structure, and lecture topics. Key concepts include the types of learning (supervised, unsupervised, reinforcement), the structure of reinforcement learning problems, and examples such as communication networks. The course is structured with lectures on various topics, including Markov Decision Processes, dynamic programming, and algorithms like Q-learning.

Uploaded by

saibaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views68 pages

RL Class Notes

The document contains lecture notes for a Reinforcement Learning course taught by Pratyush Kant and Sahil Chaudhary in Spring 2025, detailing administrative information, grading structure, and lecture topics. Key concepts include the types of learning (supervised, unsupervised, reinforcement), the structure of reinforcement learning problems, and examples such as communication networks. The course is structured with lectures on various topics, including Markov Decision Processes, dynamic programming, and algorithms like Q-learning.

Uploaded by

saibaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Reinforcement Learning Lecture Notes

E1 277
Spring 2025
Pratyush Kant & Sahil Chaudhary

Administrative Details
• Saturdays, 9:30-11:00 AM, sometimes might be used for extra classes or tutorials.
• References:
– Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto
– Neuro-Dynamic Programming by Dimitri Bertsekas and John Tsitsiklis
– Optimal Control and Dynamic Programming by Dimitri Bertsekas
– Reinforcement Learning and Optimal Control by Dimitri Bertsekas
• Teams code: 17jggiq
• The First half will be taken by Shalabh Bhatnagar and the second half by Gugan Thoppe.
• Grading: 50% sessionals, 50% final exam. Shalabh will take a quiz for 5 marks and midterms
for 20 marks. For finals, there will be a course project for 20 marks and a final exam for 30
marks.
• Four TAs: Kaustubh, Naman, Ankur and Prashana.
• Shalabh’s midterm will be on 15th February.
• Shalabh’s quiz towards the end of January.
• Quiz 1 on Saturday, 1st February.

Contents
1 Lecture 1 (Shalabh Bhatnagar) 4
1.1 Classes of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Example of Communication Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Exploration and Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Example of Tic-Tac-Toe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Lecture 2 (Shalabh Bhatnagar, Sutton chapter 1 and 2) 7


2.1 Sutton Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Sutton Chapter 2: Multi-armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Lecture 3 (Shalabh Bhatnagar, Sutton chapter 2) 10


3.1 A small digression on stochastic approximation algorithms . . . . . . . . . . . . . . . 10
3.2 Back to multi-armed bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Upper Confidence Bound (UCB) Algorithm . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 A small digression on UCB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Gradient Bandit Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Finite Horizons Problems (Bertsekas Volume 1, Chapter 1, Shalabh Bhatnagar) 14
4.1 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Finite Horizon Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.1 Principle of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Finite Horizon Problems (Shalabh Bhatnagar) 17


5.1 Dynamic Programming Example: Chess Match . . . . . . . . . . . . . . . . . . . . . 17
5.2 Dynamic Programming Example: Control of a Queue . . . . . . . . . . . . . . . . . 18

6 Stochastic Shortest Problem (Shalabh Bhatnagar) 19


6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Lecture 7: Stochastic Shortest Path Problems 22

8 Lecture 8: Stochastic Shortest Path (Shalabh Bhatnagar) 26

9 Lecture 9: Stochastic Shortest Path (Shalabh Bhatnagar) 29


9.1 Numerical Schemes for MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.1.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

10 Lecture 10: Policy Iteration (Shalabh Bhatnagar) 33


10.1 Gauss-Seidel Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
10.2 Modified Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.3 Multi Stage Look ahead Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 35

11 Lecture 11: Infinite horizon discounted horizon(Shalabh Bhatnagar) 36

12 Lecture 12: Infinite horizon discounted horizon(Shalabh Bhatnagar) 40


12.1 Value Iteration and Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

13 Lecture 13: Online Lecture (Shalabh Bhatnagar) 44


13.1 Policy Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
13.2 Recap of story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
13.3 New Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
13.4 Monte Carlo Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

14 Lecture 14: Temporal Difference Learning (Shalabh Bhatnagar) 49


14.1 Key Idea in TD algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
14.2 Analysis of such system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
14.3 TD(λ) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

15 Lecture 15: Q Learning (Shalabh Bhatnagar) 52

16 Lecture 16 56

17 Lecture 17: Application Of Stochastic Approximation To RL 57


17.1 TD Algo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2
18 Lecture 3: Temporal Difference Learning and Function Approximation 60
18.1 Markov Decision Processes (MDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
18.2 Temporal Difference (TD) Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
18.2.1 Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 60
18.2.2 Objective Function and Update Rule . . . . . . . . . . . . . . . . . . . . . . . 60
18.3 TD(0) with Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . 61

19 Lecture 19 (Gugan Thoppe) 64



19.1 Is θ∗ asymptotically stable? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
19.1.1 Lyaupnov functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

20 Lecture 20 (Gugan Thoppe) 67

3
1 Lecture 1 (Shalabh Bhatnagar)
Learning theory is categorized into three types: supervised learning, unsupervised learning and
reinforcement learning. In supervised learning, we have a dataset of input-output pairs, and we
try to learn a function that maps inputs to outputs. In unsupervised learning, we have a dataset
of inputs, and we try to learn the underlying structure of the data. In reinforcement learning, we
have an agent that interacts with an environment and tries to learn a policy that maximizes the
cumulative reward.
There is an agent, an environment and states. States describe the key features of the environment.
The agent is a decision-making entity that interacts with the environment. In the beginning, the
environment is in state S0 . The agent takes action A0 , and the environment transitions probabilis-
tically to a new state S1 and gives a reward R1 to the agent. The agent looks at the new state and
reward and takes another action A1 . The environment again jumps probabilistically to a new state
S2 and gives a reward R2 to the agent. This process continues. The goal of the agent is to select a
sequence of actions depending on the states of the environment so as to maximize the “long-term
reward”.1
Probabilistic transition between states is given by

pt (St+1 = s′ , Rt+1 = r|St = s, At = a) ∀s, s′ ∈ S, a ∈ A, r ∈ R.

If the probabilities are stationary, subscript t can is dropped. We will also discretize the time,
though it can be continuous in some cases as well.

1.1 Classes of Problems


States only describe the environment, while actions are decided by the agent. Suppose N is the
number of styles of decision making. Then the sequence generated by the agent-environment
interaction is
S0 , A0 , R1 , S1 , A1 , R2 , S2 , . . . , SN −1 , AN −1 , RN , SN .

The types of problems are based on what values N take:


• N < ∞:
– N < ∞ and N is a deterministic number: This is called the finite horizon problem.
– N < ∞ and N is a random variable: This is called the episodic or stochastic shortest
path problem.
The long-term reward is given by
"N −1 #
X
E Rt | S0 = s .
t=0

• N = ∞:
1
Transition dynamics are stationary usually but can be non-stationary as well (as in traffic).

4
– Discounted rewards: The long-term reward is given by
"N #
X
lim E γ t Rt+1 | S0 = s ,
N →∞
t=0

where γ ∈ (0, 1) is the discount factor. This has connections to economic theory and is
a good-model if value of future rewards is less than the value of immediate rewards.
– Long-term average reward: The long-term reward is given by
"N #
1 X
lim E Rt | S0 = s .
N →∞ N
t=1

This is a good model if the value of future rewards remains same.


It can be shown that
N
" # "N #!
X 1 X
lim (γ − 1) lim E γ t Rt+1 | S0 = s − lim E Rt | S0 = s = 0.
γ→1 N →∞ N →∞ N
t=0 t=1

The key ingredients of reinforcement learning problem are:


1. State (Environment)
2. Action (Agent)
3. Reward (Environment gives reward to agent)

1.2 Example of Communication Network


Suppose there is a Communication network with routers, connections and two computers S and D,
source and destination. There are four routers, R1 , R2 , R3 and R4 arranged as a tetrahedron. R1
is connected to R2 and R3 which in turn are connected to R4 . R4 is connected to D. The source S
is connected to R1 . The goal is to send a packet from S to D through the routers. The objective is
to minimize the time taken to send the packet from S to D. Path P1 is S → R1 → R2 → R4 → D
and path P2 is S → R1 → R3 → R4 → D. Every router has some buffer.
The state of the system is Ni which is the number of packets in the buffer of router Ri . The state
is the vector (N1 , N2 , N3 , N4 )T . Every packet passes through R1 and R4 , but still, they are in the
state space as one can decidePwhether or not to send a package or not.1 The action is {P1 , P2 }.2
The reward in this case is − 4i=1 Ni , or P4 1N +1 to minimize the queue lengths and the goal is
i=1 i
to minimize the queue length. There are several algorithms to solve this problem, such as Monte
Carlo Schemes (Trajectory Sampling), Temporal Difference Methods (Making better decisions at
every time step), etc.
1
N4 is not needed.
2
As long as packets are there, R4 will deliver them, so R4 is always empty.

5
1.3 Exploration and Exploitation
The problem of exploration and exploitation is a fundamental problem in learning theory. Explo-
ration is the process of trying out new things to learn more about the environment. Exploitation
is the process of using the knowledge gained so far to maximize the reward. The agent has to
balance between exploration and exploitation. If the agent exploits too much, it might miss out on
better actions. If the agent explores too much, it might not get enough reward. The agent has to
balance between exploration and exploitation. The agent has to explore enough to learn about the
environment and exploit enough to maximize the reward.
The way around which most methods use is to select a learnt action with a high probability and
select a random action with a low probability that has not been selected so far.

1.4 Example of Tic-Tac-Toe


Player 1, using X, is our RL agent, and Player 2 is a professional player using O. The state space
is the board configuration with each cell Si that can be empty or filled with X or O. The action
is changing the board configuration to X, or O. Another variable P is also needed in the state to
describe the player whose turn it is or who has played the last move. The reward can be 0 in
between moves, 1 if Player 1 wins, 21 if it’s a draw and 0 if Player 1 loses. These rewards are called
as sparse rewards because they are given only at the end of the game.

6
2 Lecture 2 (Shalabh Bhatnagar, Sutton chapter 1 and 2)
2.1 Sutton Chapter 1
The basic setting is an agent interacting with an environment and learns through this interaction.
In practice, there can be more than one agent but in this course, we will consider only one agent.
The agent looks at the state of the environment and takes an action. The environment transitions
to a new state (probabilistically) and gives a reward (probabilistically) to the agent. The agent
learns from the reward and the new state and takes another action. This process continues. The
goal of the agent is to learn a policy that maximizes the long-term reward. The state, action and
reward sequence generated is

S0 , A0 , R1 , S1 , A1 , R2 , S2 , . . . , SN −1 , AN −1 , RN , SN .

The goal of the agent is to select a sequence of actions in response to the states of the environment
so as to maximize the “long-term reward”. The long-term rewards are dependent on short-term
rewards. To throw about the randomness in the environment, we use the expectation of the rewards.
The long-term reward is called as value function.
Policy is a decision rule: In a given state, it prescribes an action to be chosen. They can be
deterministic or stochastic. For instance, if number os states is 2, S1 , s2 and number of actions 3,
a1 , a2 , a3 .
• Deterministic policy: π(s1 ) = a1 , π(s2 ) = a3 .
• Stochastic policy:
– π(s1 , a1 ) = 0.7, π(s1 , a2 ) = 0.3, π(s1 , a3 ) = 0.
– π(s2 , a1 ) = 0.2, π(s2 , a2 ) = 0.3, π(s2 , a3 ) = 0.5.
• Optimal policy: The policy that maximizes the long-term reward.
The objective is to find a policy π which maximizes the value function.
There are two parts to a RL problem:
1. Prediction: Given a policy π, estimate the value, Vπ , of the policy.
2. Control: Find the optimal policy. Control problem can only be solved after solving the
prediction problem.
We will also assume Markovian structure, i.e., the future depends only on the present state and
not on the past states:

P(St+1 = s′ | St = s, At = a, St−1 , At−1 , . . . , S0 , A0 ) = P(St+1 = s′ | St = s, At = a).

The agent also remembers the entire history of the interaction. We will now start chapter 2 of
Sutton’s book.

2.2 Sutton Chapter 2: Multi-armed Bandits


The term bandit comes from casinos where there are slot machines with levers. The agent has to
decide which lever to pull to maximize the reward.

7
We will consider a model with a single slot machine with K arms. The state is a single state and
the objective is to decide which arws to pull in what order and how many times to pull each arm.
Each time an arm is pulled a reward is generated randomly based on the probability distribution
of the arm. We will also assume their is no correlation between the rewards of the arms, hence the
sequence becomes irrelevant.
Define
q ∗ (a) := E[Rt | At = a] a ∈ {1, 2, . . . , K}.
The goal is to find a∗ = arg maxa∈{1,2,...,K} q ∗ (a). The agent does not know q(a) and has to estimate
it. The agent has no information about the distribution of the rewards.
Define Pn
i=1 Ri · I(Ai−1 = a)
Qn (a) := P n ∀ a ∈ {1, 2, . . . , K}.
i=1 I(Ai−1 = a)
The expression is the average reward of arm a after n pulls serving as an estimate of q ∗ (a) at time
n. The possible strategies are:
• Greedy strategy: Pull the arm a where

a = arg max Qn (a).


a∈{1,2,...,K}

Not a great strategy as it does not allow for exploration.


• ε-greedy strategy: With probability 1 − ε, pull the arm a where
(
arg maxa∈{1,2,...,K} Qn (a) with probability 1 − ε,
a=
random arm with probability ε.

Read section 2.3 of Sutton’s book for more details.


Suppose K = 10 with reward distributions N (µi , 1) where µi is the mean of the distribution of arm
i. The goal is to find the arm with the maximum mean.
Suppose we decide to select an action a always. Then,
Pn
Ri+1
Qn (a) = i=1 .
n
It needs huge amounts of storage to store all the rewards. A more efficient way is to use the
recursive formula:

Pn+1
Ri+1
i=1
Qn+1 (a) =
n+1
n
!
1 X
= Ri+1 + Rn+2
n+1
i=1
1
= (nQn (a) + Rn+2 )
n+1
1
= Qn (a) + (Rn+2 − Qn (a)) .
n+1

8
This saves storage and is called as incremental update rule as we don’t need to store all the
rewards. A reward is seen and it is used to update the estimate of the mean of the arm without
storing it.  
a.s.
Qn (a) −−→ q ∗ (a) as n → ∞ i.e. P lim Qn (a) = q ∗ (a) = 1.
n→∞

Instead if we use the update rule for some α ∈ (0, 1) being a constant:

Qn+1 (a) = Qn (a) + α (Rn+1 − Qn (a))


= αRn+1 + (1 − α)Qn (a)
= αRn+1 + (1 − α) (αRn + (1 − α)Qn−1 (a))
= αRn+1 + (1 − α)αRn + (1 − α)2 Qn−1 (a)
.
= ..
n+1
X
=α (1 − α)n+1−i Ri + (1 − α)n+1 Q0 (a).
i=1

This is called as exponential recency-weighted average which gives more weight to recent
rewards. Verify that the sum of all the weights is 1. These class of algorithms are called as
fading memory algorithms. These are typically used in non-stationary environments where the
dynamics changes over time.
Q0 (a) is the initial estimate of the mean of the arm. which is typically set to 0 if no information is
available. Otherwise if some information is available, it can be set to the mean of the rewards.

9
3 Lecture 3 (Shalabh Bhatnagar, Sutton chapter 2)
In the case of multi-armed bandits, if the ri ’s are deterministic then there is no need for exploration.
But in the case of stochastic rewards, exploration is needed.
SLLN: Let X1 , X2 , . . . be a sequence of i.i.d. random variables with E[Xi ] = µ < ∞. Then,
n n
!
1X a.s. 1X
Xi −−→ µ as n → ∞ i.e. P lim Xi = µ = 1.
n n→∞ n
i=1 i=1

Let ria denote the reward of arm a at time i. Then by SLLN,


n
1 X a a.s. ∗
ri −−→ q (a) as n → ∞.
n
i=1

We have also seen the iterative update rule:


1 a

Qn+1 (a) = Qn (a) + rn+1 − Qn (a) .
t+1
1
Denote αt := t+1 . Then,
a

Qn+1 (a) = Qn (a) + αt rn+1 − Qn (a) .
1
Instead of αt = t+1 , we can also use αt as a positive arbitrary sequence. One has to also ensure
convergence, which is guaranteed by the conditions called as Robbins-Monro conditions (which
acts as a generalization of SLLN) and given by:


X ∞
X
αt = ∞ and αt2 < ∞.
t=1 t=1
1
Thus sequences like t+1 1
and (t+1) log(t+1) , log(t+1)
t+1 where t ≥ 1 are valid. These algorithms are called
as stochastic approximation algorithms.

3.1 A small digression on stochastic approximation algorithms


These stochastic approximation algorithms started in the 1950s and 1960s. The first algorithm
was by Robbins and Monro in 1951 (appeared in Annals of Mathematical Statistics). The first
application was by Kiefer and Wolfowitz in 1952. The first proof of convergence was obtained by
Robbins and Siegmund in 1971. The first application in RL was by Watkins in 1989.
Robbins and Monro, in 1951, solved the root-finding problem. Let f : Rd 7→ Rd be a function. The
goal is to find x∗ ∈ Rd such that f (x∗ ) = 0. The issue is f is not known. However, one can sample
noisy versions of f at different points, f (x) + Ψ, where Ψ is a noise term. Suppose we are able to
get as many noisy samples as we want. The algorithm is as follows:

xt+1 = xt + αt (f (xt ) + Ψt ) ,

where αt is a positive sequence and f (xt ) + Ψt is the noisy sample of f at time t. One can provably
t→∞
argue that under some conditions such that starting from arbitrary x0 , xt −−−→ x∗ such that

10
f (x∗ ) = 0. What they showed was that the sequence of xt ’s converges to the root of f in the mean
square sense, i.e.,
 t→∞
E ∥xt − x∗ ∥2 −−−→ 0.


Subsequently, it was shown that the sequence of xt ’s converges to the root of f almost surely, i.e.,
 
P lim xt = x∗ = 1.
t→∞

The constraints on αt are the same as the Robbins-Monro conditions.


Some of the applications of stochastic approximation algorithms are:
• Fixed point of F : Rd 7→ Rd or a zero of the function F − I, where I is the identity map.
• (Local) Minimum of a function F : Rd 7→ R. We can set f = −∇F .
These algorithms are significant as they don’t need model of the environment and are model-free.
They are also simple to implement and are computationally efficient. They are also robust to noise
and can be used in non-stationary environments, such as, signal processing, optimization, traffic
control, etc.

3.2 Back to multi-armed bandits


Recall our algorithm
a

Qt+1 (a) = Qt (a) + αt Rt+1 − Qt (a)
 a  a
 a  
= Qt (a) + αt · E Rt+1 | At = a + Rt+1 − E Rt+1 | At = a − Qt (a)

a
 a   a 
The noise is Rt+1 − E Rt+1 | At = a and thefunction is f (Q t ) = E Rt+1 | At = a − Qt (a) and
hence the algorithm converges to Q∗ (a) = E Rt+1
a | At = a (which is q ∗ (a) defined previously)


under the Robbins-Monro conditions.

3.3 Upper Confidence Bound (UCB) Algorithm


The problem with ε-greedy algorithm is that it does not take into account the uncertainty in the
estimates. For instance, if Qn (a) = 0 and n = 1, then the algorithm will select the arm with
probability K1 even though the uncertainty is high.
The UCB algorithm tries to take into account the uncertainty in the estimates. The algorithm is
as follows: s !
log(t)
At = arg max Qt (a) + c ,
a∈{1,2,...,K} Nt (a)
whereq c > 0 is a constant and Nt (a) is the number of times arm a has been pulled till time t. The
log(t)
term N t (a)
is the uncertainty in the estimate of the mean of the arm. The term log(t) is used
to ensure that the uncertainty decreases as t increases. The term c is used to control the trade-off
between exploration and exploitation and is called as exploration parameter.
q
log(t)
The term Qt (a) is the exploitation term and the term N t (a)
is the exploration term. The algorithm
tries to balance between exploration and exploitation.

11
Initially at t = 0, then Nt (a) = 0 for all a and we select one action arbitrarily, say a. Then,
Nt (a) = 1 and the uncertainty term is not ∞ for a unlike other arms.
As t increases, Nt (a) increases as well but log t increases at a much slower rate and is practically
a constant. Hence, the exploration term decreases as t increases and eventually dies out with the
algorithm using the exploitation term only. In summary, initially the algorithm explores and as t
increases, the algorithm exploits.

3.4 A small digression on UCB algorithm


2 2
Suppose R1 , R2 , . . . are independent and sub-Gaussian,
Pn i.e., E[Ri ] = 0 and E[eλRi ] ≤ eλ σ /2 for all
1
λ ∈ R and some σ > 0. Let Qn (i) := n i=1 Ri be the sample mean of the rewards. Then, by
Hoeffding’s inequality,
2
P (Qn (i) ≥ t) ≤ e−nt /2 .
q q
2 /2 2 log(1/δ)
Let δ = e −nt . Then, t = n . For δ = n , t = 2 log
1 n
n . Therefore, for estimate of mean
reward, a good estimate is s
2 log n
Qn (i) + .
Nn (i)
Hence, it is better than ε-greedy algorithm since ε-greedy algorithm takes every action uniformly,
no matter how many times it has been taken.

3.5 Gradient Bandit Algorithms


The UCB algorithm is good when the rewards are bounded. But if the rewards are unbounded,
then the UCB algorithm does not work. The gradient bandit algorithm is a good alternative to the
UCB algorithm when the rewards are unbounded. Unlike other algorithms discussed, the gradient
bandit algorithm is a policy-based algorithm and not a value-based algorithm (doesn’t look into
the estimates of the rewards).
Define Ht (a) as a preference for arm a at time t. The Gibb’s or Boltzmann distribution is given by
eHt (a)
P(At = a) = πt (a) = PK .
Ht (b)
b=1 e

In order to update the preferences, we use the following gradient update rule:
∂E[Rt ]
Ht+1 (a) = Ht (a) + αt ,
∂Ht (a)
P
where E[Rt ] = x πt (x)q∗ (x) and q∗ (x) = E[Rt | At = x].

!
∂E[Rt ] ∂ X
= πt (x)q∗ (x)
∂Ht (a) ∂Ht (a) x
X ∂πt (x)
= q∗ (x)
x
∂Ht (a)
X ∂πt (x)
= (q∗ (x) − βt ) ·
x
∂Ht (a)

12
A good choice of βt brings down the variance of the rewards improving the convergence of the
algorithm keeping the expression unbiased, since, βt is independent of x and thus,
X ∂πt (x) X ∂πt (x)
βt = βt
x
∂Ht (a) x
∂Ht (a)
!
∂ X
= βt πt (x)
∂Ht (a) x

= βt (1) = 0.
∂Ht (a)
H (x)
Recall πt (x) = Pe t . Then,
y Ht (y)

!
∂πt (x) ∂ eHt (x)
= P H (y)
∂Ht (a) ∂Ht (a) ye
t

eHt (x) y eHt (y) − eHt (x) eHt (a)


P
= P 2
e Ht (y)
y

= 1x=a πt (x) − πt (x)πt (a)


= πt (x)(1x=a − πt (a)).

Hence, the partial derivative beocmes:

∂E[Rt ] X
= (q∗ (x) − βt ) · πt (x)(1x=a − πt (a)).
∂Ht (a) x

We select βt = R̄t where R̄t is the average reward at time t. Thus, the update rule becomes:
X 
Ht+1 (a) = Ht (a) + αt q∗ (x) − R̄t · πt (x)(1x=a − πt (a)).
x

Observe that
X    
q∗ (x) − R̄t · πt (x)(1x=a − πt (a)) = E q∗ (At ) − R̄t (1At =a − πt (a))
x

Use E[x] = E[E[x | y]] to get:


      
E q∗ (At ) − R̄t (1At =a − πt (a)) = E E q∗ (At ) − R̄t (1At =a − πt (a)) | At = a
  
= E Rt − R̄t (1At =a − πt (a)) (as q∗ (At ) = Rt )

Throw away the expectation and we get:



Ht+1 (a) = Ht (a) + αt Rt − R̄t (1At =a − πt (a)).

t→∞
Running the algorithm for a long time, we get Ht (a) −−−→ H ∗ (a) where H ∗ (a) is the preference
for arm a at convergence. The algorithm is called as softmax action selection.

13
4 Finite Horizons Problems (Bertsekas Volume 1, Chapter 1, Sha-
labh Bhatnagar)
4.1 Markov Decision Processes (MDPs)
MDPs assume that we know the system model. However, finding the optimal policy is still a difficult
problem. The key assumption is a controlled Markov chain. We have a state space denoted by S
and an action space denoted by A. Given a state s, A(s) denotes the set of feasible actions in state
s. Further, [
A= A(s).
s∈S

Let {Xn } be a sequence of random variables defined on a common probability space, (Ω, F, P).
It depends on a control valued sequence {Zn } such that Zn ∈ A(Xn ). The sequence {Xn } is a
controlled Markov chain if for all n ≥ 0 and all s ∈ S,

P(Xn+1 = s′ | Xn = s, Zn = a, Xn−1 = sn−1 , Zn−1 = an−1 , . . . , X0 = s0 , Z0 = a0 )


= P(Xn+1 = s′ | Xn = s, Zn = a).

Very similar to the definition of Markov chains. The transition probability is denoted by p(s, a, s′ )
(called as transition probabilities) and is P(Xn+1 = s′ | Xn = s, Zn = a). Some properties are:
1. p(s, a, s′ ) ≥ 0.

P
2. s′ ∈S p(s, a, s ) = 1.

A Markov Decision Process (MDP) is a controlled Markov chain with a cost structure, a cost
associated with every transition, denoted by g(in , an , in+1 ). The cost is a function of the current
state, the action taken and the next state. Here, Xn = in , Xn+1 = in+1 and Zn = an . Clearly,
in , in+1 ∈ S and an ∈ A(in ).
For now, we will assume that the state space, action space and feasible actions are known along
with the transition probabilities.

4.2 Decision Making


Horizon of Decision Making: N denotes the number of instants of time needed for decision making.
We classify these problems as:
1. N < ∞: Finite horizon problems.
2. N < ∞ but is random: Stochastic shortest path or episodic problems.
3. N = ∞: Infinite horizon problems.

4.3 Finite Horizon Problems


A policy is a decision rule specified as

π = {µ0 , µ1 , . . . , µN −1 }.

At time N the process terminates (called as terminal instant). For each k ∈ [N − 1], µk : S 7→ A
is a function such that µk (s) ∈ A(s) for all s ∈ S.

14
µ0 is used to take actions at time 0, µ1 at time 1 and so on. The collection of these functions is
called as a policy. The objective is to find an optimal policy, π ∗ , that minimizes the cost over the
horizon N .
For each x0 ∈ S and each policy π define Jπ (x0 ) as
N −1
" #
X
Jπ (x0 ) = E gN (xN ) + gk (xk , µk (xk ), xk+1 ) | X0 = x0 .
k=0

The expectation is taken over the joint distribution of the random variables X0 , X1 , . . . , XN under
the policy π.
The objective is to find a policy π ∗ such that Jπ∗ (x0 ) ≤ Jπ (x0 ) for all x0 ∈ S and all policies π.
Here gk (xk , µk (xk ), xk+1 ) is the cost incurred at time k when the state is xk , action is ak = µk (xk )
and the next state is xk+1 at any instant k ∈ [N −1]. gN (xN ) is the terminal cost when the terminal
state is xN .1 In a finite horizon problem, it is very difficult to find a policy independent of time.
The optimal policy is time-dependent most of the time.
Let Π denote the set of all policies. The optimal policy is denoted by π ∗ ∈ Π such that Jπ∗ (x0 ) ≤
Jπ (x0 ) for all x0 ∈ S and all π ∈ Π. The optimal cost is denoted by J ∗ (x0 ) = Jπ∗ (x0 ) =
minπ∈Π Jπ (x0 ). Observe that π ∗ is independent of x0 . One can also have multiple optimal policies.

4.3.1 Principle of Optimality


Let π ∗ = {µ∗0 , µ∗1 , . . . , µ∗N −1 } be an optimal policy. Assume that when following the policy π ∗ , the
system reaches state xi at time i with a positive probability. Consider the following subproblem;
starting at state xi at time i with the objective:
N −1
" #
X
min E gN (xN ) + gk (xk , µk (xk ), xk+1 ) | Xi = xi .
π i ={µi ,µi+1 ,...,µN −1 }
k=i

The principle of optimality states that the optimal policy for this subproblem is

(π i )∗ = {µ∗i , µ∗i+1 , . . . , µ∗N −1 }.

The optimal policy for the subproblem is the same as the optimal policy for the original problem.
This is the principle of optimality. The principle of optimality is a necessary condition for optimality.
If it was not true, then the optimal policy for the subproblem would be different from the optimal
policy for the original problem. This would imply that the optimal policy for the original problem
is not optimal, which is a contradiction.
Hence, the optimal policy is independent of the history of the system.

4.4 Dynamic Programming


Dynamic programming is a method to solve finite horizon problems. The method is based on the
principle of optimality.
1
Instead of functions µk one can model the problem as finding the probability distribution of the actions at each
time instant, each µi becomes a probability distribution.

15
Proposition: For every initial state x0 ∈ S, the optimal cost J ∗ (x0 ) = J0 (x0 ), that is the optimal
from the last step of the following algorithm:

JN (xN ) = gN (xN ) ∀ xN ∈ S,
Jk (xk ) = min EXk+1 [gk (xk , ak , xk+1 ) + Jk+1 (xk+1 )] ∀ k ∈ [N − 1], xk ∈ S.
ak ∈A(xk )

The second equation means that the optimal cost at time k and state xk is the minimum of the
cost of taking action ak in state xk and the expected cost of following the optimal policy from time
k + 1 and state xk+1 .

Proof. For any admissible π = {µ0 , µ1 , . . . , µN −1 }, let π k = {µk , µk+1 , . . . , µN −1 }. Denote


"N −1 #
X

Jk (xk ) = min E(Xk+1 ,...,XN ) gi (xi , µi (xi ), xi+1 ) + gN (xN ) | Xk = xk .
πk
i=k

Jk∗ (xk ) is the optimal cost for the N − k stage subproblem. Let JN ∗ (x ) = g (x ) = J (x ) for
N N N N N
all xN ∈ S. We will show using induction that Jk∗ (xk ) = Jk (xk ) for all k ∈ [N − 1] ∪ {0} and
all xk ∈ S. Assume that for some k and all xk+1 , we have Jk+1 ∗ (x
k+1 ) = Jk+1 (xk+1 ). Note that
k k+1
π = {µk , π }. Then, ∀ xk ∈ S,
"N −1 #
X

Jk (xk ) = min E(Xk+1 ,...,XN ) gi (xi , µi (xi ), xi+1 ) + gN (xN ) | Xk = xk .
µk ,π k+1
i=k

The expectation can be split as:


" " N −1 # #
X
min EXk+1 gk (xk , µk (xk ), xk+1 ) + E(Xi )N gi (xi , µi (xi ), xi+1 ) + gN (xN ) | Xk+1 = xk+1 | Xk = xk .
µk ,π k+1 i=k+2
i=k+1

This can be further simplified as:


" " N −1 # #
X
min EXk+1 gk (xk , µk (xk ), xk+1 ) + min E(Xi )N gi (xi , µi (xi ), xi+1 ) + gN (xN ) | Xk+1 = xk+1 | Xk = xk .
µk π k+1 i=k+2
i=k+1

∗ (x
By definition, the inner expectation is Jk+1 ∗
k+1 ). From the induction hypothesis, Jk+1 (xk+1 ) =
Jk+1 (xk+1 ). Hence, the above expression is:

Jk∗ (xk ) = min EXk+1 [gk (xk , µk (xk ), xk+1 ) + Jk+1 (xk+1 ) | Xk = xk ] .
µk

Recall µk : S 7→ A and µk (xk ) ∈ A(xk ). Hence, the above expression reduces to:

Jk∗ (xk ) = min EXk+1 [gk (xk , ak , xk+1 ) + Jk+1 (xk+1 ) | Xk = xk ] .


ak ∈A(xk )

This is the definition of Jk (xk ). Hence, Jk∗ (xk ) = Jk (xk ) for all k ∈ [N − 1] ∪ {0} and all xk ∈ S.

16
5 Finite Horizon Problems (Shalabh Bhatnagar)
5.1 Dynamic Programming Example: Chess Match
Consider an example of a chess match between a player and an opponent. The goal is to formulate
an optimal policy from the viewpoint of the player. A player can select:
• Timid play: The player plays defensively and never wins. The draw has probability pd and
the loss has probability 1 − pd .
• Bold play: The player plays aggressively and never draws. The win has probability pw and
the loss has probability 1 − pw .
Once a player chooses a strategy, it then sticks to it. Further, pd > pw . The score assignment is as
follows:
• Win: 1.
• Draw: 0.5.
• Loss: 0.
We define the state space as s = (points of a player) - (points of an opponent). (This is by design
a maximization problem.) We further assume that the intermediate rewards rk (xk , a, xk+1 ) are 0
for all k ∈ [N − 1]. Only the terminal reward rN (xN ) = JN (xN ) shows up. If there is a draw, then
the match goes into death mode and the person who wins the next game wins the match.
The optimal reward to go at k-th stage is denoted by Jk (xk ) and is:

Jk (xk ) = max {pd Jk+1 (xk ) + (1 − pd )Jk+1 (xk − 1), pw Jk+1 (xk + 1) + (1 − pw )Jk+1 (xk − 1)} .

Here JN (xN ) = 1 if xN > 0, pw if xN = 0 and 0 if xN < 0. Assume pd > pw . For k = N − 1, we


have:

JN −1 (xN −1 ) = max {pd JN (xN −1 ) + (1 − pd )JN (xN −1 − 1), pw JN (xN −1 + 1) + (1 − pw )JN (xN −1 − 1)} .

We know the structure of JN . we will consider different cases:


• xN −1 > 1: In this case, JN −1 (xN −1 ) = 1 and JN (xN −1 − 1) = 0. Both the terms are 1 and
hence either of the two strategies is optimal.
• xN −1 = 1: In this case, JN −1 (1) = max{pd + (1 − pd )pw , pw + (1 − pw )pw }. The first term is
pd + pw pw − pd pw and the second term is pw + pw pw − pw pw . The first term is greater than
the second term and hence the first strategy, playing timidly, is optimal.
• xN −1 = 0: In this case, JN −1 (0) = max{pd pw , pw }. The optimal strategy is then bold.
• xN −1 = −1: In this case, JN −1 (−1) = max{0, pw }. The optimal strategy is then bold.
• xN −1 < −1: In this case, JN −1 (xN −1 ) = 0 and JN (xN −1 − 1) = 0. Both the terms are 0 and
hence either of the two strategies is optimal.
Consider a situation where scores are tied and two games are remaining. We need to find JN −2 (0).
We have:

JN −2 (0) = max{pd JN −1 (0) + (1 − pd )JN −1 (−1), pw JN −1 (1) + (1 − pw )JN −1 (−1)}


= max{pd pw + (1 − pd )p2w , pw (pd + (1 − pd )pw ) + (1 − pw )p2w }.

17
In this situation, optimal strategy is to play boldly. If the player is higher by one point, it’s optimal
to play timid.

5.2 Dynamic Programming Example: Control of a Queue


Once a packet goes out of the server, the next one joins. The buffer size is n (total packets which
can be accomodated in the server). If more than n packets come, additional ones will be dropped.
Consider a finite horizon problem with N stages. Customer arrivals and like wise departures happen
at time 0, 1, . . . , N − 1. At time N , the process ends. At any point in time, the system can serve
only one customer during a period, a period being the time between two consecutive arrivals. If
a customer can take multiple periods of service (if a customer arrives at time 1, it may or may
not leave until any time n > 1). Let pm denote the probability of m arrivals in a period. Further
assume that the number of arrivals in a period is independent of the number of arrivals in any other
period. There are two types of services:
• Slow service: The cost is Cs and customer leaves with probability qs .
• Fast service: The cost is Cf and customer leaves with probability qf .
The same costumer can be served different types of service in different periods. Let r(i) be the
holding cost, the cost of holding i costumers in a period and R(i) is the terminal cost when i
customers remains at the termination instant, N . The single stage cost is then defined as r(i) + Cs
for slow service and r(i)+Cf for fast service. The transition probabilities are p0j (µf ) = pj = p0j (µs )
for j = 0, 1, . . . , n − 1. In case of j = n, p0n (µs ) = p0n (µf ) = ∞
P
j=n pj . For i > 0,


0 if j < i − 1,

qf p0 if j = i − 1,



pij (µf ) = qf pj−i+1 − (1 − qf )pj−i i − 1 < j < n − 1,
qf ∞
 P
m=n−i pm + (1 − qf )pn−1−i if j = n − 1,




(1 − qf ) ∞
 P
m=n−i pm if j = n.

The analogous equations hold for slow service with qs replacing qf . The Dynamic Programming
algorithm is as follows:

JN (i) = R(i) ∀ i ∈ [n],


n
X n
X
Jk (i) = min{r(i) + Cs + pij (µs )Jk+1 (j), r(i) + Cf + pij (µf )Jk+1 (j)} ∀ k ∈ [N − 1], i ∈ [n].
j=0 j=0

18
6 Stochastic Shortest Problem (Shalabh Bhatnagar)
So far, we have seen
• Bascs of RL. (Ch 1 of Sutton and Barto)
• Multi-armed bandit problem. (Ch 2 of Sutton and Barto)
• Finite Horizon MDPs. (Ch 1 of Bertsekas Volume 1)
Today we will cover the Stochastic Shortest Path Problem. (Refer to NDP, RL and Optimal Control
by Dimitri Bertsekas and John Tsitsiklis or Ch 2 of Bertsekas Volume 1 (much more detailed)).
Stochastic shortest-path problems are characterized by a goal state or terminal state. The goal
state occurs with probability 1, but when it occurs, it is not known. Stochastic Shortest path
problems are also referred to as episodic problems.

6.1 Problem Formulation


• We assume that there is a cost-free termination state, denoted 0. The state is an absorbing
state, i.e., P00 (u) = 1 and g(0, u, 0) = 0 for all u ∈ A(0).
• We also assume no discounting, γ = 1.
The problem is to reach the terminal state with the minimum expected cost. Define Jµ (i) as
N −1
!
X
Jµ (i) := lim Eµ g(sk , µ(sk ), Sk+1 ) | S0 = i .
N →∞
k=0

We consider stationary policies, i.e., π = {µ, µ, . . .}. It’s convenient to call µ as the policy instead
of π. A stationary policy can be shown to be the optimal in Stochastic Shortest Path Problems
and even Infinite Horizon Problems. We give some definitions:
1. Proper Policy: A stationary policy µ is proper if

Pµ = max P(sn ̸= 0 | s0 = i, µ) < 1,


i=1,2,...,n

where n is the number of non-terminal states. It says that after n states, the probability of
reaching the terminal state is positive.
2. Improper Policy: A stationary policy µ is improper if

Pµ = max P(sn ̸= 0 | s0 = i, µ) = 1.
i=1,2,...,n

It says that after n states, the probability of reaching the terminal state is 0. Basically, a
policy which is not proper is improper.
As a remark, µ is proper ⇐⇒ In the Markov Chain cooresponding to µ, there is a path of positive
probability from any state to the terminal state. Recall the general definition of an MDP:

P(Xn+1 = s′ | Xn = s, Zn = a, Xi = ij , Zj = aj , j ∈ [n − 1]) = P(Xn+1 = s′ | Xn = s, An = a).

When governed by µ, it becomes

P(Xn+1 = s′ | Xn = s, Zn = µ(i), Xi = ij , Zj = µ(j), j ∈ [n−1]) = P(Xn+1 = s′ | Xn = s, An = µ(s)).

19
Denote
P(Xn+1 = j | Xn = i, Zn = µ(i)) = Pµ (i, j)
where Pµ is the transition probability matrix satisfying:
1. Pµ (i, j) ≥ 0.
PN
2. j=1 Pµ (i, j) = 1.

It is a homogeneous Markov Chain. If the policy is non-stationary, then the transition probabilities
depend on the time instant. The transition probabilities are denoted by Pµk (i, j) where k is the
time instant, and the Markov chain is a non-homogeneous Markov Chain.
We ask ourselves, what is the probability that after 2n steps, the system is not in the terminal
state?
P(S2n ̸= 0 | S0 = i, µ) = P(S2n ̸= 0 | Sn ̸= 0, S0 = i, µ) · P(Sn ̸= 0 | S0 = i, µ)
+ P(S2n ̸= 0 | Sn = 0, S0 = i, µ) · P(Sn = 0 | S0 = i, µ).

Observe that P(S2n ̸= 0 | Sn = 0, S0 = i, µ) = 0 and P(Sn = 0 | S0 = i, µ) =≤ Pµ (i). Hence, the


above expression becomes:
P(S2n ̸= 0 | S0 = i, µ) = P(S2n ̸= 0 | Sn ̸= 0, S0 = i, µ) · P(Sn ̸= 0 | S0 = i, µ)
≤ P(S2n ̸= 0 | Sn ̸= 0, S0 = i, µ) · Pµ (i).
Because of the Markov property, P(S2n ̸= 0 | Sn ̸= 0, S0 = i, µ) ≤ Pµ (i). Hence,
P(S2n ̸= 0 | S0 = i, µ) ≤ Pµ (i)2 .
In general,
k
∀ i ∈ [n], P(Sk ̸= 0 | S0 = i, µ) ≤ Pµ (i)⌊ n ⌋ .
For a justification, let k ∈ (n, 2n). Then,
P(Sk ̸= 0 | S0 = i, µ) = P(Sk ̸= 0 | S0 = i, µ) · P(Sn ̸= 0 | S0 = i, µ) ≤ 1 · Pµ (i) = Pµ (i).
From the above inequality,
lim P(Sk ̸= 0 | S0 = i, µ) = 0.
k→∞
Hence, the probability of reaching the terminal converges to 1. Recall that
N −1
!
X
Jµ (i) = lim Eµ g(sk , µ(sk ), Sk+1 ) | S0 = i .
N →∞
k=0

If µ is a proper policy and |g| is bounded, |g(i, a, j)| ≤ M for all i, j, a ∈ A(i), then

!
X
Jµ (i) = Eµ g(sk , µ(sk ), Sk+1 ) | S0 = i .
k=0


X
|Jµ (i)| ≤ Eµ (|g(sm , µ(sk ), Sm+1 ) | S0 = i)
m=0
X∞ XX
= pm
ij (µ(i)) · Pjk (µ(j)) · |g(j, µ(j), k)|.
m=0 j k

20
P
Let ĝµ (j) := k Pjk (µ(j)) · |g(j, µ(j), k)|. Then,
∞ X
X
|Jµ (i)| ≤ pm
ij (µ) · ĝµ (j).
m=0 j

If j = 0 then ĝµ (j) = 0. Then,


∞ X
X n
|Jµ (i)| ≤ pm
ij (µ) · max ĝµ (j).
j=1,2,...,n
m=0 j=1

Pn m ⌊m⌋
j=1 pij (µ) = P(Sm ̸= 0, s0 = i, µ) ≤ Pµ (i) and maxj=1,2,...,n ĝµ (j) ≤ k. Thus,
n
Further,

X ⌊m⌋
|Jµ (i)| ≤ Pµ n (i) · k < ∞ ∀ i ∈ [n] as Pµ (i) < 1.
m=0

Define ḡ(i, µ) = nj=0 pij (µ) · g(i, µ(i), j) is the expected single stage cost in non-terminal state i ∈
P
[n] when action µ is chosen. We now define mappings T and Tµ as a function J = (J(1), . . . , J(n))
where J is a mapping from non-terminal states (NT) to R.
 
X n
(T J)(i) := min ḡ(i, µ) + pij (µ) · J(j) ∀ i ∈ [n]
µ∈A(i)
j=1
n
X
(Tµ j)(i) := ḡµ (i) + pij (µ) · J(j) ∀ i ∈ [n] and ḡµ (i) = ḡ(i, µ(i)).
j=1

T and Tµ are operators on the space of mappings from NT states to R. They act on J to give
another mapping. Define the matrix Pµ as
 
p11 (µ(1)) p12 (µ(1)) . . . p1n (µ(1))
 p21 (µ(2)) p22 (µ(2)) . . . p2n (µ(2)) 
Pµ =  .
 
.. .. .. ..
 . . . . 
pn1 (µ(n)) pn2 (µ(n)) . . . pnn (µ(n))

Pµ is not a stochastic matrix (in general) as the sum of the elements in each row is ≤ 1 since the
matrix is only over the non-terminal states.

21
7 Lecture 7: Stochastic Shortest Path Problems
Episodic or Stochastic Shortest Path Problems are characterized by a goal state or terminal state,
0. The goal state occurs with probability 1, but when it occurs, it is not known. Stochastic Shortest
path problems are also referred to as episodic problems. The nonterminal states are referred to as
1, 2, . . . , n.
p00 (u) = 1 ∀ u ∈ A(0) and g(0, u, 0) = 0 ∀ u ∈ A(0).
We say that a policy µ is proper if

Pµ = max P(sn ̸= 0 | s0 = i, µ) < 1.


i=1,2,...,n

It says that after n states, the probability of reaching the terminal state is positive. Let S =
{1, . . . , n} be the set of non-terminal states. Let S + = S ∪ {0}.
Let’s get back to the analysis of the Stochastic Shortest Path Problem.
n
X
ḡ(i, u) = pij (u) · g(i, u, j)
j=0

is the expected single stage cost in the state i ∈ [n] when action u is chosen. Define mappings
T, Tµ : R|S| 7→ R|S| , where R|S| = {f | f : S →7 R}, as follows: Let J = (J(1), . . . , J(n)) be a
mapping from non-terminal states to R. Then,
 
Xn
(T J)(i) := min ḡ(i, u) + pij (u) · J(j) ∀ i ∈ S
u∈A(i)
j=1
n
X
(Tµ J)(i) := ḡµ (i) + pij (µ(i)) · J(j) ∀ i ∈ S and ḡµ (i) = ḡ(i, µ(i)).
j=1

T J(i) is a real number. T J is a vector in R|S| , given by T J = (T J(1), . . . , T J(n)). T is an operator


on the space of mappings from non-terminal states to R. It acts on J to give another mapping.
Tµ J is also a vector in R|S| , given by Tµ J = (Tµ J(1), . . . , Tµ J(n)). Tµ is an operator on the space
of mappings from non-terminal states to R. It acts on J to give another mapping.
We also defined the matrix Pµ as
 
p11 (µ(1)) p12 (µ(1)) ... p1n (µ(1))
 p21 (µ(2)) p22 (µ(2)) ... p2n (µ(2)) 
Pµ =  .
 
.. .. .. ..
 . . . . 
pn1 (µ(n)) pn2 (µ(n)) . . . pnn (µ(n))

Pµ is not a stochastic matrix (in general) as the sum of the elements in each row is ≤ 1 since the
matrix is only over the non-terminal states.
Using this notation, we can write
Tµ J = ḡµ + Pµ J,
where ḡµ = (ḡµ (1), . . . , ḡµ (n)). Further define T k J = T (T k−1 J) for k ≥ 0, where T 0 := I.

T k J = (T ◦ T ◦ . . . ◦ T )J,

22
which is the k-fold composition of T with itself applied to J.
Consider k = 2. Then,
 
n
X
(T 2 )J(i) = T (T J)(i) = min ḡ(i, u) + pij (u) · (T J)(j)
u∈A(i)
j=1
 !
n
X n
X
= min ḡ(i, u) + pij (u) · min ḡ(j, v) + pjk (v) · J(k)  .
u∈A(i) v∈A(j)
j=1 k=1

The above expression can be interpreted in the context of finite horizon problems as the optimal
cost of a two stage problem with single stage costs ḡ(·, ·) and terminal cost J(·). Then, for any k,
(T k J)(i) is the optimal cost of a k-stage problem with initial state i, single stage costs ḡ(·, ·) and
terminal cost J(·).
 
Xn
(T k J)(i) = min ḡ(i, u) + pij (u) · (T k−1 J)(j) ∀ i ∈ S = {1, 2, . . . , n}.
u∈A(i)
j=1

¯
Lemma 1: Monotonicity Lemma: For any J, J¯ ∈ R|S| , if J(i) ≤ J(i) for all i ∈ S, then
¯ and Tµ J(i) ≤ Tµ J(i)
T J(i) ≤ T J(i) ¯ for all i ∈ S.

Proof. Let J, J¯ ∈ R|S| be such that J(i) ≤ J(i)


¯ for all i ∈ S. Then, for any i ∈ S, we have:
 
n
X
T J(i) = min ḡ(i, u) + pij (u) · J(j)
u∈A(i)
j=1
 
n
X
≤ min ḡ(i, u) + ¯ 
pij (u) · J(j)
u∈A(i)
j=1
¯
= T J(i).

¯ for all i ∈ S. Similarly, Tµ J(i) ≤ Tµ J(i)


Now we can induct on k to show that T k J(i) ≤ T k J(i) ¯
for all i ∈ S.
Pn
Lemma 2: ∀ k ≥ 0, vector J = (J(1), . . . , J(n)), stationary policy µ and r > 0. Let e = j=1 ei be
the vector of all ones in Rn . Then,
1. (T k (J + re))(i) ≤ (T k J)(i) + r for all i ∈ S.
2. (Tµk (J + re))(i) ≤ (Tµk J)(i) + r for all i ∈ S.
If r < 0, then the inequalities are reversed.

23
Proof. 1. Consider k = 1.
 
n
X
(T (J + re))(i) = min ḡ(i, u) + pij (u) · (J + re)(j)
u∈A(i)
j=1
 
n
X n
X
= min ḡ(i, u) + pij (u) · J(j) + r pij (u)
u∈A(i)
j=1 j=1
 
n
X
≤ min ḡ(i, u) + pij (u) · J(j) + r = (T J)(i) + r.
u∈A(i)
j=1

Induct on k to show that (T k (J + re))(i) ≤ (T k J)(i) + r for all i ∈ S.


2. Follows similarly.

We will make these two (fairly reasonable) assumptions in further analysis:


1. There exists at least one proper policy.
2. For every improper policy µ, Jµ (i) = ∞ for at least one state i ∈ S.
Proposition: We have the following:
(a) For a proper policy µ, the associated cost vector Jµ staisfies

lim (Tµk J)(i) = Jµ (i) ∀ i ∈ S ∀ J ∈ Rn .


k→∞

Moreover, Jµ = Tµ Jµ and Jµ is the unique fixed point of Tµ .


(b) A stationary policy strategy for some vector J satisfying J(i) ≥ Tµ J(i) for all i ∈ S is proper.
k→∞
J, Tµ J, Tµ2 J, . . . −−−→ Jµ . Jµ is the unique fixed point of Tµ . Jµ = Tµ Jµ means
X
Jµ (i) = ḡ(i, µ(i)) + pij (µ(i))Jµ (j) ∀ i ∈ S.
j∈S

This is the Bellman equation for the policy µ. This method also gives a way to compute Jµ by
iterating Tµ starting from an arbitrary vector J. This numerical method is called value iteration.

Proof. (a) Recall


Tµ J = ḡµ + Pµ J.

(Tµ2 J) = Tµ (Tµ J) = ḡµ + Pµ (Tµ J)


= ḡµ + Pµ (ḡµ + Pµ J)
= ḡµ + Pµ ḡµ + Pµ2 J.

Induct on k to show that


k−1
X
Tµk J = Pµk J + Pµm ḡµ .
m=0

24
We have shown seen that
k
P(sk ̸= 0 | s0 = i, µ) ≤ Pµ (i)⌊ n ⌋ ∀ i ∈ S.

n
X
(Pµk J)(i) = P(sk = j | s0 = i, µ) · J(j)
j=1
n
X
≤ P(sk = j | s0 = i, µ) · max J(j)
j=1,2,...,n
j=1

= P(sk ̸= 0 | s0 = i, µ) · max J(j)


j=1,2,...,n
k k→∞
≤ Pµ (i)⌊ n ⌋ · max J(j) −−−→ 0.
j=1,2,...,n

Hence,
k−1
X
lim Tµk J = lim Tµk J = Pµk J + Pµm ḡµ = Jµ .
k→∞ k→∞
m=0

By definition, Tµk+1 J = qµ + Pµ Tµk J. Take the limit as k → ∞ to get Jµ = gµ + Pµ Jµ = Tµ Jµ .


Hence, Jµ is the fixed point of Tµ . Suppose ∃ J¯µ such that J¯µ = Tµ J¯µ . If we repeatedly apply
Tµ to J¯µ , we get Jµ .
k→∞
J¯µ = Tµ J¯µ = Tµ2 J¯µ = . . . −−−→ Jµ .
Then, Jµ = limk→∞ Tµk J¯µ = J¯µ . Hence, Jµ is the unique fixed point of Tµ .

25
8 Lecture 8: Stochastic Shortest Path (Shalabh Bhatnagar)
Recall the proposition,
(a) For a proper policy µ, the associated cost vector Jµ satisfies

lim (Tµk J)(i) = Jµ (i) ∀ i ∈ S ∀ J ∈ Rn .


k→∞

Moreover, Jµ = Tµ Jµ and Jµ is the unique fixed point of Tµ .


(b) A stationary policy strategy for some vector J satisfying J(i) ≥ Tµ J(i) for all i ∈ S is proper.
We have proved the first part of the proposition. We will now prove the second part.

Proof. For a stationary policy µ, suppose ∃ J ∈ Rn such that J(i) ≥ (Tµ J)(i) for all i ∈ S. By
monotonicity of Tµ , we have (Tµ J)(i) ≥ (Tµ2 J)(i). Applying this recursively, we get

k−1
!
X
J(i) ≥ (Tµ J)(i) ≥ (Tµ2 J)(i) ≥ (Tµ J)(i) = (Pµ J)(i) + Pµk gµ (i).
m=0

If µ were not proper, then by assumption Jµ (i) = ∞ for some i ∈ S, a contradiction to the above
inequality as limk→∞ k−1 m
P
m=0 Pµ gµ = Jµ and J(i) is finite, =⇒ ⇐= . Hence, µ must be proper.

We now present a generalization of the above proposition.


Proposition 2:
(a) The optimal cost vector J ∗ satisfies J ∗ = T J ∗ (known as the Bellman equation). Moreover,
J ∗ is the unique solution of the Bellman equation.
(b) limk→∞ T k J(i) = J ∗ (i) for all i ∈ S and all J ∈ R|S| .
(c) A stationary policy µ is optimal if and only if Tµ J ∗ = J ∗ .

Proof. ((a), (b)) We will first show that T has most one fixed point. Suppose J and J ′ are two
fixed points of T . Let µ and µ′ be such that

J = T J = Tµ J
J ′ = T J ′ = Tµ′ J ′ .

Note that X
(T J)(i) = max pij (µ) (g(i, µ, j) + J(j)) ∀ i ∈ S.
u∈A(i)
j∈S

Suppose for i, the minimization action is µi = µ(i). Then,


X
(T J)(i) = pij (µ) (g(i, µ, j) + J(j)) = Tµ J(i).
j∈S

Hence, such µ and µ′ always exists. Thus,

J = Tµ J J ′ = Tµ′ J ′ .

26
From the proposition 1 (b), µ and µ′ are proper. By proposition 1 (a), J = Jµ and J ′ = Jµ′ . Now,

J = T J = T 2 J = . . . = T k J,

for any k ≥ 1. Further, T k J ≤ Tµk′ J as Tµ′ is evaluation but T k is minimization. It then follows
that J ≤ limk→∞ Tµk′ J = Jµ′ = J ′ . Similarly, J ′ ≤ J. Hence, J = J ′ and T has at most one fixed
point.
We will now show that T has at most one fixed point. Let µ be a proper policy (there exists a
proper policy by assumption A). Let µ′ be another policy such that Tµ′ Jµ = T Jµ . Then,

Jµ = Tµ Jµ ≥ T Jµ = Tµ′ Jµ
=⇒ Jµ ≥ Tµ′ Jµ =⇒ µ′ is proper by proposition 1 (b).

Furthermore,

Jµ ≥ Tµ′ Jµ ≥ Tµ2′ Jµ ≥ . . . ≥ lim Tµk′ Jµ = Jµ′ =⇒ Jµ ≥ Jµ′ .


k→∞

Continuing in this manner, we obtain a sequence of policies {µk } such that each µk is proper and

Jµk = Tµk Jµk ≥ T Jµk = Tµk+1 Jµk ≥ Tµ2k+1 Jµk ≥ lim Tµnk+1 Jµk = Jµk+1 .
n→∞

Hence,
Jµk ≥ Tµk+1 Jµk ≥ Jµk+1 ∀ k,
where Tµk+1 Jµk = T Jµk . However, one cannot keep on improving the cost on Jµk indefinitely.
Hence, there exists a policy µ such that Jµ ≥ T Jµ ≥ Jµ . Hence, Jµ = T Jµ . By proposition 1 (a),
Jµ is the unique fixed point of T .
k→∞
Next we willl show that Jµ = J ∗ and T k J −−−→ J ∗ . Let e = (1, 1, . . . , 1) and δ > 0 is a scalar. Let
Jˆ be a n-dimensional vector satisfying Tµ Jˆ = Jˆ − δe.

Tµ Jˆ = Jˆ − δe
=⇒ Jˆ = Tµ Jˆ + δe
ˆ
= (gµ + δe) + Pµ J.

Jˆ is the cost vector corresponding to the policy µ with gµ replaced by gµ + δe. Hence, there will
exist Jˆ satisfying Tµ Jˆ = Jˆ − δe. Moreover, Jµ ≤ J.
ˆ This implies that

Jµ = T Jµ ≤ T Jˆ ≤ Tµ Jˆ = Jˆ − δe ≤ J.
ˆ

If we keep on applying the operator T , we get:

Jµ = T k Jµ ≤ Tµk Jµ ≤ T k−1 Jˆ ≤ J.
ˆ

k→∞
Thus, T k Jˆ is a bounded monotone sequence and T k Jˆ −−−→ Jˆ such that
 
T J = T lim T J = lim T k+1 Jˆ = J˜ =⇒ J˜ = Jµ
ˆ k ˆ
k→∞ k→∞

as Jµ is the unique fixed point of T . Further, from monotonicity lemma,

Jµ − δe = T Jµ − δe ≤ T (Jµ − δe) ≤ T Jµ = Jµ .

27
Further,
T (Jµ − δe) ≤ T 2 (Jµ − δe) ≤ . . . ≤ Jµ .
Hence, T k (Jµ − δe) is a monotonically increasing sequence bounded above. Also, limk→∞ T k (Jµ −
δe) = Jµ such that Jµ − δe ≤ J ≤ J.ˆ (Recall that Jˆ is the cost vector for policy µ with single stage
costs gµ + δe.) Again from monotonicity of T ,

T k (Jµ − δe) ≤ T k J ≤ T k Jˆ ∀ k ≥ 1.

Also,
Jµ = lim T k (Jµ − δe) ≤ lim T k J ≤ lim T k Jˆ = Jµ .
k→∞ k→∞ k→∞

To show that Jµ = J ∗, take any policy π = {µ0 , µ1 , . . .}. Then,

Tµ0 Tµ1 . . . Tµk−1 J0 ≥ T k J0 where J0 is any arbitrary vector


=⇒ lim sup Tµ0 Tµ1 . . . Tµk−1 J0 ≥ lim sup T k J0 =⇒ Jπ ≥ Jµ .
k→∞ k→∞

Since the policy π was arbitrary, µ must be optimal. Hence, Jµ = J ∗ .

Proof. ((c)) If µ is optimal, then Jµ = J ∗ . By assumptions (A) and (B), µ is proper. By proposition
1 (a),
Tµ J ∗ = Tµ Jµ = Jµ = J ∗ = T J ∗ .

Conversely, let Tµ J ∗ = T J ∗ . Since µ is proper, we have J ∗ = Jµ as J ∗ = Tµ J ∗ = T J ∗ and ∃!


solution Jµ to the Bellman equation.

28
9 Lecture 9: Stochastic Shortest Path (Shalabh Bhatnagar)
Recall that we were looking at the operator T : R|S| 7→ R|S| defined as
 
X
T J(i) = min  pij (u) (g(i, u, j) + J(j)) ∀ i ∈ S.
u∈A(i)
j∈S

Another operator Tµ : R|S| 7→ R|S| is defined as


X
Tµ J(i) = pij (µ(i)) (g(i, µ(i), j) + J(j)) ∀ i ∈ S.
j∈S

In today’s lecture, we will show that T and Tµ are contraction maps in a certain norm, || · ||ψ , i.e.,
∃ β ∈ (0, 1) such that for all J, J¯ ∈ R|S| ,
¯ ψ ≤ β||J − J||
||T J − T J|| ¯ ψ,

and
¯ ψ ≤ β||J − J||
||Tµ J − Tµ J|| ¯ ψ.

Recall that S = {1, 2, . . . , n} is the set of non-terminal states and 0 is the terminal state. Let
S + = S ∪ {0} be the set of all states.
Browder’s Fixed Point Theorem: Let S is a complete separable metric space concerning a
metric P . Suppose T is a contraction concerning P . Then, ∃ a fixed point x∗ of T such that
T x∗ = x∗ .1
We will show that there is a vector φ = (φ(1), . . . , φ(n)) such that φ(i) > 0 for all i and a scalar
β ∈ [0, 1) such that for all J, J¯ ∈ R|S| ,
¯ ψ ≤ β||J − J||
||T J − T J|| ¯ ψ,

where
|J(i)|
||J||ψ = max .
i∈S φ(i)

Proposition Assume all stationary policies are proper. Then, ∃ a vector φ = (φ(1), . . . , φ(n)) such
that φ(i) > 0 for all i such that the mappings T and Tµ for all stationary policies µ are contractions
with respect to the norm || · ||ψ . In particular, ∃ β ∈ (0, 1) such that
n
X
pij (u)φ(j) ≤ βφ(i) ∀ i ∈ S, u ∈ A(i).
j=1

Proof. Consider a new stochastic shortest path problem where transition probabilities are the same
as before. Still, transition costs are all equal to −1, except the transition state where

g(0, u, 0) = 0 ∀ u ∈ A(0).
1
A complete metric space is a metric space in which every Cauchy sequence converges to a point in the space. A
separable metric space is a metric space that has a countable dense subset.

29
ˆ as the optimal cost to go from state i in the new problem. Then,
Denote J(i)
X
ˆ = −1 + min
J(i) ˆ
pij (u)J(j)
u∈A(i)
j∈S
X
≤ −1 + ˆ for any given u ∈ A(i).
pij (u)J(j)
j∈S

ˆ
Let φ(i) = −J(i). Then, ∀ i, φ(i) ≥ 1. Then,
X  
ˆ ≥1+
−J(i) ˆ
pij (u) −J(j)
j∈S
X
φ(i) ≥ 1 + pij (u)φ(j).
j∈S
X φ(i) − 1
pij (u)φ(j) ≤ φ(i) − 1 ≤ βφ(i) β := max < 1.
i∈S φ(i)
j∈S

Now for stationary policy µ, state i and vectors J, J¯ ∈ R|S| , we have


X
¯ ¯

|(Tµ J)(i) − (Tµ J)(i)| = pij (µ(i)) J(j) − J(j)
j∈S
 
n ¯
X |J(j) − J(j)|
≤ pij (µ(i))φ(j) · max
j∈S φ(j)
j=1
¯ ψ
≤ βφ(i) · ||J − J||
¯
|(Tµ J)(i) − (Tµ J)(i)|
=⇒ ¯ ψ
≤ β||J − J|| ∀ i ∈ S.
φ(i)
Since the inequality holds for all i ∈ S, we have
¯
|(Tµ J)(i) − (Tµ J)(i)|
max ¯ ψ.
≤ β||J − J||
i∈S φ(i)
The above inequality then reduces to
¯ ψ ≤ β||J − J||
||Tµ J − Tµ J|| ¯ ψ.
Thereofre, Tµ is a contraction with respect to the norm || · ||ψ . The above inequality gives us the
following:
¯
(Tµ J)(i) ≤ (Tµ J)(i) ¯ ψ.
+ β||J − J||
Take the minimum on u ∈ A(i) to get
¯
(T J)(i) ≤ (T J)(i) ¯ ψ.
+ β||J − J||
Using the another inequality, we get
¯
(T J)(i) ≥ (T J)(i) ¯ ψ.
− β||J − J||
Combining the two inequalities, we get
¯
|T J(i) − T J(i)| ¯ ψ.
≤ β||J − J||
This implies that T is a contraction with respect to the norm || · ||ψ .

We now turn our attention to numerical schemes for solving the MDPs.

30
9.1 Numerical Schemes for MDPs
9.1.1 Value Iteration
Recall the proposition 1, for all J ∈ R|S| ,

lim Tµk J = J
k→∞
lim T k J = J ∗ .
k→∞

Consider the optimal control problem.


1. Choose some J ∈ R|S| .
2. Recursively iterate J ← T k J, where k = 1, 2, . . .. This is called value iteration.
k→∞
We know that T k J −−−→ J ∗ . Suppose V0 , V1 , V2 , . . . are the sequence of functions obtained by
value iteration when T is applied. Then start with some vector V0 ∈ R|S| ,
 
X
Vm+1 (i) = min  pij (u) (g(i, u, j) + Vm (j)) ∀ i ∈ S ∀ m ≥ 0.
u∈A(i)
j∈S

n→∞
By proposition 1, Vn −−−→ V ∗ where V ∗ is the optimal cost-to-go function satisfying V ∗ = T V ∗ .
Look at Sutton and Barto, Chapter 4, the Grid World example. In the Grid World example, the
state space is S = {1, 2, . . . , 16} and the action space is A(i) = {N, E, S, W } for all i ∈ S. The
two corners on the main diagonal are terminal states while the rest 14 are non-terminal states.
The feasible actions are such that the agent cannot move out of the grid. If it is a non-feasible
direction, the agent stays in the same state (cell). The rewards are all −1 until termination. After
termination, the reward is 0. Therefore, the agent must reach the goal state as soon as it can.
Consider the equiprobable random policy in which for any of the 14 non-terminal states, the agent
moves in any of the four directions with equal probability 41 . Given this policy, the aim is to apply
value iteration for the equiprobable random policy (we are not looking at the optimal policy).
• Initialize V0 (i) = 0 for all i ∈ S. This is the initial cost-to-go function.
• For the first iteration (k = 1),
1 1
V (1) ← (−1 − 1 − 1 − 1) + (0) = −1.
4 4
Likewise check that V (i) = −1 for all i ∈ S.
• For the second iteration (k = 2),
1 1 7
V (1) ← (−1 − 1 − 1 − 1) + (−1 − 1 − 1 + 0) = − .
4 4 4
Likewise check that V (i) = − 47 for all i ∈ S that are adjacent to the terminal states and the
rest are −2.
• For the third iteration (k = 3),
 
1 1 7 39
V (1) ← (−1 − 1 − 1 − 1) + − −2−2+0 =− .
4 4 4 16

31
39
Likewise, check that V (i) = − 16 for all i ∈ S that is adjacent to the terminal states and the
47
rest are − 16 but the states surrounded by the −2 becomes −3 and so-on.
• These values will converge to the optimal cost-to-go function V ∗ (but after attaining values
like −20, really slow convergence).

32
10 Lecture 10: Policy Iteration (Shalabh Bhatnagar)
We will now discuss the Gauss-Seidel value iteration.

10.1 Gauss-Seidel Value Iteration


Define an operator F : R|S| 7→ R|S| as
n
X
F J(i) = min pij (u) (g(i, u, j) + J(j)) ∀ i ∈ S.
u∈A(i)
j=1

For i = 2, 3, . . . , n,
 
Xn i−1
X n
X
(F J)(i) = min  pij (u)g(i, u, j) + pij (u)(F J)(j) + pij (u)J(j) .
u∈A(i)
j=1 j=1 j=i

As a remark,
• (F J)(1) = (T J)(1).
• limk→∞ F k J = J ∗ for all J ∈ R|S| .
There is another procedure called Policy Iteration. In value iteration, we start with some J ∈ R|S|
and repeatedly apply the operator T . In policy iteration, we start with some policy µ and update
the policy at each iteration. The procedure is as follows:
1. Start with an initial policy µ0 .
2. Policy evaluation: Given a policy µk , compute J µk (i), i ∈ S as the solution to the
n
X
J(i) = pij (µk (i)) (g(i, µk (i), j) + J(j)) ∀ i ∈ S,
j=1

in unknowns J(1), J(2), . . . , J(n).


3. Policy improvement: Find a new policy µk+1 such that
 
Xn
µk+1 (i) = arg min  pij (u) (g(i, u, j) + J µk (j)) ∀ i ∈ S.
u∈A(i)
j=1

THe J µk is estimated from the policy evaluation. Alternatively, one can solve

Tµk+1 J µk = T J µk .

The equation expands to


 
n
X Xn
pij (µk+1 (i)) (g(i, µk+1 (i), j) + J µk (j)) = min  pij (u) (g(i, u, j) + J µk (j)) ∀ i ∈ S.
u∈A(i)
j=1 j=1

4. Keep iterating the policy evaluation and policy improvement steps. Since the number of
policies is finite, the policy iteration algorithm will converge to the optimal policy in a finite
number of steps. Or one can stop the iteration when a tolerance criterion is met.

33
The above structure is like a nested loop. The outer loop is the policy iteration and the inner loop
is the policy evaluation.
Starting from an initial given J0 (·), update
n
X
Jℓ+1 (i) = pij (µ(i)) (g(i, µ(i), j) + Jℓ (j)) ∀i∈S Jℓ → J µk .
j=1

Repeat the process if J µk+1 (i) < J µk (i) for any i ∈ S. If for all i ∈ S, J µk+1 (i) = J µk (i), then stop
the iteration and output the policy µk+1 as the optimal policy.
Proposition The policy iteration algorithm generates an improving sequence of proper policies,
i.e.,
J µk+1 (i) ≤ J µk (i) ∀ i ∈ S, k ∈ N
It terminates with an optimal policy µ∗ in a finite number of steps.

Proof. Given a proper policy µ, the new policy µ̄ is obtained via policy improvement as

Tµ̄ J µ = T J µ .

Then,
J µ = Tµ J µ ≥ T J µ = Tµ̄ J µ .
In particular, J µ ≥ Tµ̄ J µ . By monotonicity of Tµ̄ , we have

J µ ≥ Tµ̄ J µ ≥ Tµ̄2 J µ ≥ . . . ≥ lim Tµ̄k J µ = J µ̄ =⇒ J µ ≥ J µ̄ .


k→∞

We know that µ is proper. How do we show that µ̄ is proper? Suppose for the sake of contradiction,
µ̄ is improper. Then, ∃ i ∈ S such that J µ̄ (i) = ∞ (assumption B). For that same i, then, J µ (i) = ∞
(as J µ ≥ J µ̄ ). This is a contradiction as µ is proper. Hence, µ̄ is proper.
Suppose µ is not optimal. Then, ∃ i ∈ S such that J µ̄ (i) < J µ (i). Otherwise, J µ = J µ̄ . In the
latter case,
J µ = J µ̄ = Tµ̄ J µ̄ = T J µ̄ = J µ = T J µ =⇒ J µ = J ∗ (J µ = T J µ ).
Hence, µ is optimal, and the new policy is strictly better than the current policy if the current
policy is not optimal. Since the number of proper policies is finite, the policy iteration algorithm
will terminate in a finite number of steps, giving an optimal proper policy.

10.2 Modified Policy Iteration


Select sequence of positive integers m0 , m1 , m2 , . . . and suppose J0 , J1 , J2 , . . . and stationary policies
µ0 , µ1 , µ2 , . . . are generated as:
Tµk Jk = T Jk ∀ k ∈ N,
and
Jk+1 = Tµmkk Jk ∀ k ∈ N.
One can show that this procedure terminates in an optimal policy µ∗ and optimal value function
J ∗ . Consider,
• mk = 1 for all k ∈ N. This one corresponds to the value iteration.
• mk = ∞ for all k ∈ N. This one corresponds to the policy iteration.

34
10.3 Multi Stage Look ahead Policy Iteration
Regular Policy Iteration uses a one-step look ahead and finds the optimal decision for a one-stage
problem with one stage cost g(i, u, j) and terminal cost J µ (j) when the policy is µ.
In m stage look ahead problem, we find the optimal policy for an m-stage dynamic programming
problem where we start in state i ∈ S, make m subsequent decisions incurring corresponding costs
of m stages and getting a terminal cost J µ (j), where j is the state reached after m stages.
Claim: The m-stage policy iteration terminates with the optimal policy under the same conditions
as regular policy iteration.

Proof. Let {µ¯0 , µ¯1 , . . . , µm−1


¯ } be an optimal policy for m-stage dynamic programming with termi-
nal cost J µ . Thus,

Tµ¯k = T m−k−1 J µ = T m−k J µ ∀ k = 0, 1, . . . , m − 1.

• For k = m − 1:
µ µ
Tµm−1
¯ J = TJ .

• For k = m − 2:
Tµ̄m−2 T J µ = T 2 J µ .

• For k = m − 3:
Tµ̄m−3 T 2 J µ = T 3 J µ .
And so on.
• For k = 0:
Tµ̄0 T m−1 J µ = T m J µ .

Observe that for all k, T J µ ≤ Tµ J µ = J µ . Repeatedly apply the operator T to get

T k+1 J µ ≤ T k J µ ≤ J µ ∀ k ∈ N.

Hence,
Tµ̄ T m−k−1 J µ = T m−k J µ ≤ T m−k−1 J µ ∀ k = 0, 1, . . . , m − 1.
Thus, ∀ ℓ ≥ 1, we have

Tµ̄ℓk T m−k−1 J µ ≤ Tµ̄k T m−k−1 J µ = T m−k J µ .

Taking limits as ℓ → ∞, we obtain

J µ̄k = lim Tµ̄ℓk T m−k−1 J µ ≤ T m−k J µ ≤ J µ ∀ k = 0, 1, . . . , m − 1. (*)


ℓ→∞

Thus, for a successor policy µ̄ generated by the m-stage policy iteration, i.e., µ̄ = µ̄0 , we have

J µ̄ ≤ T m J µ ≤ J µ (Set k = 0 in ∗).

This implies that µ̄ is an improved policy related to µ. If J µ̄ = J µ , then J µ = T J µ and J µ = J ∗ .


This algorithm also terminates in the optimal policy.

35
11 Lecture 11: Infinite horizon discounted horizon(Shalabh Bhat-
nagar)
Scribe: Sahil
We will set the indexes to a no-termining state. Let the states be denoted by {1, 2, · · · , n}.
A(i) = set of feasible direction in state i
A = ∪i∈S A(i) = set of all actions Further, we assume |S| < ∞, |A| < ∞.
Let us define,
"∞ #
X
J k (i) = min αk y(ik , µ(ik ), ik+1 |i0 = i ,
µ
k=0
where 0 < α < 1 is called the discounted factor.

J ∗ (i) is the value of state i or cost to go from state i.


Let J = (J(1), J(2), · · · , J(n)). Define the operators J and Jµ as,
n
X
(T J)(i) = min Pij (µ) (g(i, µ, j) + αJ(j)) ,i ∈ S
µ∈A(i)
j=1
n
X
(Tµ J)(i) = Pij (µ(i)) (g(i, µ(i), j) + αJ(j)) ,i ∈ S
j=1

Let Pµ be a matrix whose entries are Pi,j (µ(i)). i.e.


 
P1,1 (µ(1)) P1,2 (µ(1)) ··· P1,n (µ(1))
Pµ = 
 .. .. .. 
. . ··· . 
Pn,1 (µ(n)) Pn,2 (µ(n)) · · · Pn,n (µ(n))
P
Note that Pµ is a sstochastic matrix because j∈S Pij (µ(i)) = 1 for all i ∈ S.
Let
 Pn 
P1,j (µ(1))g(1, µ(1), j)
Pj=1
 n P2,j (µ(1))g(2, µ(2), j) 
 j=1
gµ = 

.. 

Pn . 
j=1 Pn,j (µ(n))g(n, µ(n), j)

Bellman equation under a given policy µ is given by,

Tµ J = gµ + αPµ J = J (1)

Lemma 11.1. (Monotonicity lemma):


For any vectors J, Jˆ ∈ Rn such that J(i) ≤ J(i),
ˆ for all i ∈ S, and for any stationary policy µ,
ˆ
(T k J)(i) ≤ T k J(i) , ∀i ∈ S, k = 1, 2, · · · (2)
k k ˆ
(T J)(i) ≤ (T J)(i)
µ µ (3)

36
Let e = (1, 1, · · · , 1). Then, for any vector J = (J(1), J(2), · · · , J(n)) and r ∈ R,
n
X
(T (J + re))(i) = min Pij (µ) (g(i, µ, j) + α(J + re)(j))
µ∈A(i)
j=1
n
X
= min Pij (µ)(g(i, µ, j) + αJ(j)) + αr
µ∈A(i)
j=1

= T J(i) + αr
= T J + αre
Lemma 11.2. For every k, vector J, stationary µ and scalar r,
(T k (J + re))(i) = (T k J)(i) + αk r , i = 1, · · · , n, k ≥ 1 (4)
(Tµk (J + re))(i) = (Tµk J)(i) + αk r , i = 1, · · · , n, k ≥ 1 (5)
Proof will follow from induction. Complete the proof.
We can convert a DCP to a SSPP by adding a termination state.
ADD the state addition here!!!!!
Probability of termination in the first stage = 1 − α
Probability of termination in the second stage = α(1 − α)
..
.
Probability of termination in the k th stage = αk−1 (1 − α)
Probability of non termination event in k th stage is given by,
Pk = 1 − (1 − α)(1 + α + · · · + αk−1 )
(1 − αk )
= 1 − (1 − α)
(1 − α)
= αk
Pn
Expected Single stage cost in k th stage=αk j=1 Pij g(i, µ, j).

Note: All policies are proper for the associated SSPP since from every state under every policy,
there is a probability of 1 − α of termination.
Note also that for DCP, the expected single stage cost at k th stage,
n
X
k
=α Pij (µ)g(i, µ, j)
j=1

Under Policy µ,

" #
X
SSPP:Jµ (i) = E g(ik , µ(ik ), ik+1 )|i0 = i (6)
k=1
"∞ #
X
DCP:Jµ (i) = E αk g(ik , µ(ik ), ik+1 )|i0 = i (7)
k=1

g(i, µ, 0) = 0 is the equivalence.


Consider now a DCP whose |g(i, µ, j)| ≤ M for all i, j ∈ S, µ ∈ A(i).

37
Proposition 11.3. For any bounded J : S → R, the optimal cost function satisfies J ∗ (i) =
limN →∞ (T N J)(i) for all i ∈ S

Proof. Consider a policy π = {µ0 , µ1 , · · · , } with µk : S → A such that µk (i) ∈ A(i) for all
i ∈ S, k ≥ 0. The,
"N −1 #
X
Jπ (i) = lim E αk g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0
−1
"K−1 N
#
X X
= lim E αk g(ik , µk (ik ), ik+1 ) + αk g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0 k=K
"K−1 # "N −1 #
X X
k k
=E α g(ik , µk (ik ), ik+1 )|i0 = i + lim E α g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0 k=K

Since, |g(ik , uk (ik ), ik+1 )| ≤ M for all ik , ik+1 ∈ S, µk ∈ A(ik ),


"N −1 # ∞
X
k
X αK M
lim E α g(ik , µk (ik ), ik+1 )|i0 = i ≤ M αk =
N →∞ 1−α
k=K k=K

Thus,
"K−1 # "N −1 #
X X
k k
E α g(ik , µk (ik ), ik+1 )|i0 = i = Jπ (i) − lim E α g(ik , µk (ik ), ik+1 )|i0 = i
N →∞
k=0 k=K

"K−1 #
αk M X
Jπ (i) − − αk max |J(j)| ≤ E αk g(ik , µk (ik ), ik+1 ) + αk I(ik )|i0 = i
(1 − α) j∈S
k=0
αk M
≤ Jπ (i) + + αk max |J(j)|
(1 − α) j∈S

Taking min over π on all sides, we have for all i ∈ S, and K > 0,

αk M αk M
J ∗ (i) −− αk max |J(i)| ≤ (T k J)(i) ≤ J ∗ (i) + − αk max |J(i)| (8)
1−α j∈S 1−α j∈S

αk M αk M
lim J ∗ (i) − − αk max |J(i)| ≤ lim (T k J)(i) ≤ lim J ∗ (i) + − αk max |J(i)| (9)
K→∞ 1−α j∈S K→∞ K→∞ 1−α j∈S
K ∗
lim (T J)(i) = J (i) (10)
K→∞

Corollary 11.3.1. DP Convergence for a given policy:


For every stationary policy µ, the associated cost function satisfies,

Jµ (i) = lim TµN J (i), ∀i ∈ S, J ∈ R|S|



(11)
N →∞

38
Proof. Consider an alternative MDP where,

A(i) = {µ(i)} , ∀i ∈ S

Then,using proposition 11.3,

Jµ (i) = lim (TµN J)(i), ∀i ∈ S, J ∈ R|S|


N →∞

Proposition 11.4. (Bellman Equation:) The optimal cost function I ∗ satisfies,


X
J ∗ (i) = min Pij (µ) (g(i, µ, j) + αI ∗ (j)) ∀i ∈ S
µ∈A(i)
j∈S
∗ ∗
J = TJ

Moreover, J ∗ is the unique solution of this equation within the class of bounded functions.

Proof. Recall eq. (8),

αk M αk M
J ∗ (i) − − αk max |J(i)| ≤ (T k J)(i) ≤ J ∗ (i) + − αk max |J(i)|
1−α j∈S 1−α j∈S
k
α M k
α M
T J ∗ (i) − − T αk max |J(i)| ≤ (T K+1 J)(i) ≤ T J ∗ (i) + T − T αk max |J(i)| (Apply T)
1−α j∈S 1−α j∈S
k
α M k
α M
lim T J ∗ (i) − − T αk max |J(i)| ≤ lim (T K+1 J)(i) ≤ lim T J ∗ (i) + T − T αk max |J(i)|
K→∞ 1−α j∈S K→∞ K→∞ 1−α j∈S
K+1 ∗
lim (T J)(i) = T J (i)
K→∞

Thus, T J ∗ = J ∗ To prove uniqueness, Jˆ ∈ Rn is another solution other than J ∗ . Then apply T as


follows:

lim T k Jˆ = J ∗
k→∞
Jˆ = J ∗

This completes the proof.

39
12 Lecture 12: Infinite horizon discounted horizon(Shalabh Bhat-
nagar)
Scribe: Sahil
Corollary 12.0.1. Bellman Equation for a given policy:
FOr every stationary policy µ the associated cost function satisfies,
X
Jµ (i) = Pij (µ(i))(g(i, µ(i), j) + αIµ (i)), ∀i ∈ S
j∈S

Moreover Iµ is the unique solution to this equation within the class of bounded functions.
Proposition 12.1. Necessary and Sufficient Conditions:
A stationary policy µ is optimal if and only if µ(i) attains the minimum in the belman equation ∀
i ∈ S, i.e.

T J ∗ = Tµ J ∗

Proof. Suppose T J ∗ = Tµ J ∗ . Then,

J ∗ = T J ∗ = Tµ J ∗ use proposition 11.4


J ∗ = Tµ J ∗ or J ∗ = J µ

Conversely, suppose µ is optimal. Then J ∗ = J µ . Then,

J ∗ = Tµ J ∗ = T J ∗

Thus,

T J ∗ = Tµ J ∗ (12)

This completes the proof.


Proposition 12.2. For any two bounded functions J : S → R and J ′ : S → R and for all
k = 0, 1, · · · ,
1. ||T k J − T k J ′ ||∞ ≤ αk ||J − J ′ ||∞
2. ||Tµk J − Tµk J ′ ||∞ ≤ αk ||J − J ′ ||∞

Proof. Let C = ||J − J ′ ||∞ ,

C = max |J(i) − J ′ (i)|


i∈S

Please note that,

J(i) − C ≤ J ′ (i) ≤ J(i) + C , ∀i ∈ S


k ′
(T J)(i) − α C ≤ (T J )(i) ≤ (T J)(i) + αk C
k k k

|(T k J)(i) − (T k J ′ )(i)| ≤ αk C, ∀i ∈ S


max |(T J)(i) − (T J )(i)| ≤ α ||J − J ′ ||∞
k k ′ k
i∈S
||(T k J)(i) − (T k J ′ )(i)||∞ ≤ αk ||J − J ′ ||∞

Similarly, do the same for (2). This completes the proof.

40
Corollary 12.2.1. Rate of Convergence of Value Iteration
For any bounded function J : S → R. We have,

max |(T k J)(i) − J ∗ (i)| ≤ αk max |J(i) − J ∗ (i)| (13)


i∈S i∈S
max |(Tµk J)(i) − Jµ∗ (i)| ≤ αk max |J(i) − Jµ∗ (i)| (14)
i∈S i∈S

Example: A machine can be in one of n states 1, 2, · · · , n where 1 is the best state and n is the
worst state.
• Suppose the transition probabilities Pij are known
• Cost of operating machine for one period is g(i) when state of machine is i.
Now define your action as following:
(
Select ”O”, Operate Machine
action =
Select ”C”, Replace by a new Machine

Cost incurred when C is chosen is R. Once replaced new machine is guaranteed to stay in state 1
for one period. Suppose α ∈ (0, 1) is a given discount factor. The bellman equation for the given
system is given by:
n
X
∗ ∗
J (i) = min{R + g(1) + αJ (1), g(i) + α Pij J ∗ (j)}
j=1

Then,
( Pn
Use action C , if R + g(1) + αJ ∗ (1) < g(i) + α j=1 Pij J
∗ (j)
Optimal Policy =
O, otherwise

Note: Assume,
1. g(1) ≤ g(2) ≤ · · · ≤ g(n)
2. Pij = 0 if j < i
3. Pij ≤ P(i+1)j if i < j
Then,
X X
Pij J ∗ (j) ≤ P(i+1)j J ∗ (j)
j j
X X

g(i) + α Pij J (j) ≤ g(k) + α Pkj J ∗ (j), i<k
j j
Pn
Let SR = {i ∈ S|R + g(1) + αJ ∗ (1) ≤ g(i) + α j=1 Pij J
∗ (j)} Let
(
∗ Smallest State in SR , if SR is non empty
i =
(n + 1), if SR is empty

Optimal Policy: Replace Machine if and only if i ≥ i∗ .

41
Recall that,
n
X
(Tµ J)(i) = Pij (µ(i))(g(i, µ(i), j) + αI(j)) , i ∈ S
j=1
n
X n
X
= Pij (µ(i))g(i, µ(i), j) + α Pij (µ(i))J(j) ,i ∈ S
j=1 j=1
n
X
= ĝ(i, µ(i)) + α Pij (µ(i))J(j) ,i ∈ S
j=1

Let,
   
ĝ(1, µ(1)) P11 (µ(1)) P12 (µ(1)) · · · P1n (µ(1))
ĝµ = 
 .. 
Pµ = 
 .. .. .. 
.  . . ··· . 
ĝ(n, µ(n)) Pn1 (µ(1)) Pn2 (µ(2)) · · · Pnn (µ(n))

Then, Tµ J = ĝµ + αPµJ .


J µ is the unique fixed point of this equation. Thus,

J µ = ĝµ + αPµ J µ
(I − αPµ )J µ = ĝµ
J µ = (I − αPµ )−1 ĝµ

Please note that this is only valid for a fixed value iteration.

12.1 Value Iteration and Error Bounds


We have shown that starting from any J ∈ Rn ,

lim (T k J)(i) = J ∗ (i) , ∀i ∈ S


k→∞

Also,

|(T k J)(i) − J ∗ (i)| ≤ αk |J(i) − J ∗ (i)| ∀i ∈ S

Recall that,
 

X
J µ (i) = E  αk g(ik , µ(ik ), ik+1 )|i0 = i
µ=0

X
= ĝ(i, µ(i)) + αk E[g(ik , µ(ik ), ik+1 )|i0 = i]
k=1

Suppose β = mini ĝ(i, µ(i)) and β = maxi ĝ(i, µ(i)). Then,

β ≤ ĝ(i, µ(i)) ≤ β ∀i

42
In vector Notation,

αβ αβ
ĝµ + e ≤ Jµ ≤ ĝµ + e
1−α 1−α

Since, β ≤ ĝµ ≤ β, we have,

β αβ αβ β
e ≤ ĝµ + e ≤ Jµ ≤ ĝµ + e≤ e
1−α 1−α 1−α 1−α
Given a vector J, We know that Tµ J = ĝµ + αPµ J. Subtracting the above from J µ = ĝµ + αPµ J µ ,
we get,

J µ − Tµ J = αPµ (J µ − J)
J µ − J = (Tµ J − J) + αPµ (J µ − J)

Thus, if cost per stage vector is Tµ J − J, then J µ − J is the cost to go vector. Then,
γ αγ αγ γ
e ≤ Tµ J − J + e ≤ J µ − J ≤ Tµ J − J + e≤ e
1−α 1−α 1−α 1−α
where,

γ = min [(TµJ )(i) − J(i)]


i
γ = max [(Tµ J)(i) − J(i)]
i

Proposition 12.3. For every function J : S → R, state i and k ≥ 0,

(T k J)(i) + C k ≤ (T k+1 J)(i) + C K+1 ≤ J ∗ (i) ≤ (T k+1 J)(i) + C k+1 ≤ T k J(i) + C k

where,
α h i
Ck = min (T k J)(i) − (T k−1 J)(i)
1−α i h
α i
Ck = max (T k J)(i) − (T k−1 J)(i)
1−α i

43
13 Lecture 13: Online Lecture (Shalabh Bhatnagar)
Scribe: Sahil
Topics to cover:
1. Policy Iteration for Discounted Cost
2. Monte Carlo Technique
3. Temporal Difference Learning Algorithms(full state case)
Proposition 13.1. Let us assume that µ and µ be two stationary policies such that,

Tµ J µ = T J µ

or equivalently,
 
n
X n
X
g(i, µ(i)) + α Pij (µ(i))J µ (j) = min g(i, u) + α Pij (µ)J µ (j) , ∀i ∈ [n]
u∈A(i)
j=1 j=1

Then, Jµ (i) ≤ Jµ (i) for all i ∈ [n].


Moreover, if µ is not optimal, strict inequality holds in the above for at least one state i.

Proof. Since J µ = Tµ J µ and by hypothesis,

Tµ J µ = T J µ

i.e
n
X
µ
∀i, J (i) = g(i, µ(i)) + α Pij (µ(i))J µ (j)
j=1
Xn
≥ g(i, µ(i)) + α Pij (µ(i))J µ (j)
j=1
µ
= Tµ J (i)

Thus,

J µ = Tµ J µ ≥ T J µ = Tµ J µ
J µ ≥ Tµ J µ

Applying Tµ repeatedly above and using monotonicity,

J µ ≥ Tµ J µ ≥ Tµ2 J µ ≥ · · · ≥ lim Tµk J µ = J µ


k→∞
⇒ Jµ ≥ Jµ

If J µ = J µ , then,

J µ = J µ = Tµ J µ = Tµ J µ = T J µ = T J µ since Tµ J µ = T J µ
∴ Jµ = T Jµ & Jµ = T Jµ

44
Since, T has a unique fixed point(since T is a valid contraction),

J µ = J µ = J ∗ (optimal value function)

This implies, µ and µ are optimal policies.


Thus, if µ is not optimal, then there exists at least one i such that

J µ (i) > J µ (i)

This completes our proof.

13.1 Policy Iteration Algorithm


• Step 01: Initialize a stationary policy µ0 .
• Step 02: Policy Evaluation
k
Given a stationary policy µk , compute corresponding cost function J µ from the linear system
of equations,
k
(I − αPµk )J µ = g µk

or solve
k k
J µ = g µk + αPµk J µ
k
= Tµk J µ

• Step 03: Policy Improvement


Obtain a new stationary policy µk+1 satisfying,
k k
Tµk+1 J µ = T J µ
k k
If J µ = T J µ , stop else, go back to step 2 and repeat the process.
Note here,

g(1, µk (1))
 
 Pn k k

 g(2, µk (2))  j=1 P (1, µ (1), j)g(1, µ (1), j)
g µk = ..
=
   
.. . 
 .  Pn k k
g(n, µk (n)) j=1 P (n, µ (n), j)g(n, µ (n), j)

Also,
P (1, µk (1), 1) · · · P (1, µk (1), n)
 

Pµ k =
 .. .. .. 
. . . 
k k
P (n, µ (n), 1) · · · P (n, µ (n), n)

45
13.2 Recap of story
• Basics of RL
• Multi armed Bandits (single state with multiple actions)
1. Greedy strategy
2. ϵ−greedy strategy
3. UCB Exploration
4. Gradient based search
• Markov Decision Process:
we assume knowledge of system model i.e. transition probabilities, reward function etc. In
MDPs, we have covered,
1. Finite Horizon Problems (N < ∞ but deterministic)
2. Stochastic Shortest Path Problems(N < ∞ but random)
3. Discounted Cost Problems(N = ∞)
Algorithms covered:
1. Dynamic Programming Algorithm(Finite Horizon Problems)
2. Bellman Equation(Stochastic Shortest Path Problems and Discounted Cost Problems)
– Value Iteration
– Policy Iteration

13.3 New Story


We shall assume no knowledge of system model. In return, we will have access to data(i.e. data in
context of RL is nothing trajectory)

13.4 Monte Carlo Techniques


Recall,
T
" #
X
J µ (i) = Eµ r(sk , µ(sk ), sk+1 )|s0 = i
k=1

which is basically cost to go under policy µ.


We don’t know J µ (i) but we do know the rewards and wish to use monte carlo to estimate them.
Monte-Carlo schemes largely work with sample averages of collected data over trajectories.

46
Monte carlo method can also be written as an update rule:
n
1 X
Vn (s) = Gm , n ≥ 1 when s0 = s
n
m=1
n+1
1 X
Then, Vn+1 (s) = Gm
n+1
m=1
n
n 1 X n 1
= Gm + Gn+1
n+1n n+1n
m=1
n 1
= Vn (s) + Gn+1
n+1 1+n
1
= Vn (s) + (Gn+1 − Vn (s))
n+1
In general, one may let
Vn+1 (s) = Vn (s) + αn (Gn+1 − Vn (s))
where {αn }n≥0 are step sizes or learning rate such that,
X X
αn = ∞, a2n < ∞
n n

Note that: As n → ∞,
Vn (s) → Eµ [Gn+1 |sn+1 = s]
= J µ (s)

Online version of this algorithm


Vn+1 (sn ) = Vn (sn ) + αn (Gn+1 − Vn (sn ))
with Vn+1 (s) = Vn (s)∀s ̸= sn
Another way of writing the above,
Vn+1 (s) = Vn (s) + αn I{s=sn } (Gn+1 − Vn (sn ))
where,
(
1, if s = sn
Is=sn =
0, otherwise

Recall that,
Vn+1 (sn ) = Vn (sn ) + αn (Gn+1 − Vn (sn ))
N
X −n
= Vn (sn ) + αn ( Rn+j − Vn (sn ))
j=1
N
X −n
= Vn (sn ) + αn (Rn+j + Vn (sn+j ) − Vn (sn+j−1 ))
j=1

47
Let di = Rn+i + Vn (sn+i ) − Vn (sn+i−1 ). These quantities are referred to Temporal difference
terms or also as, Temporal Error.
Then,
N
X −n
Vn+1 (sn ) = Vn (sn ) + αn dj
j=1

Vn+l+1 (sn ) = Vn+l (sn ) + αn dn+l ∀l = 0, 1 · · · , N − n

48
14 Lecture 14: Temporal Difference Learning (Shalabh Bhatna-
gar)
The work started by Rick sutton in 1984 for his PhD thesis.
Recall Monte carlo scheme tries to solve for,

Vπ (s) = Eπ (Gn |Sn = s)


PN −n
where Gn = i=1 Rn+i and N is the terminal instant of episode
Monte carlo scheme works with sample average data collected over trajectories.

14.1 Key Idea in TD algorithm


Instead of looking at Vπ (s) = Eπ [Gn |sn = s], we look at bellman equation

Vπ (s) = E[Rn+1 + Vπ (sn+1 )|sn = s]


or,
Eπ [Rn+1 + Vπ (sn+1 ) − Vπ (sn )|sn = s] = 0

The problem is we don’t know the expectation and we resort to solving this by TD recursion,

Vn+1 (sn ) = Vn (sn ) + αn dn for n ≥ 0 with (15)


Vn+1 (s) = Vn (s)∀s ̸= sn (16)

Alternatively,

Vn+1 (s) = Vn (s) + αn I{s=sn } (Rn+1 + Vn (sn+1 ) − Vn (sn ))


(
1, if s = sn
where Is=sn =
0, otherwise

14.2 Analysis of such system


Suppose the Markov chain {sn } under policy π is ergodic, i.e. irreducible, aperiodic & positive
recurrent.
Then, starting from any initial distribution, {sn } will settle into a steady state or stationary
distribution or stationary distribution that will be unique.
Form a sequence of time points {t(n)} as follows:

t(0) = 0, t(1) = α0 , t(2) = α0 + α1 , · · ·

and,

αn > 0, ∀n
P
condition on {αn } : n αn = ∞ ⇒ t(n) → ∞ as n → ∞

P 2
n αn < ∞

49
Plot a graph of Vn vs t(n), all these points are discrete. Draw a line between these points. One
intuition you can get to approximate these would be to consider ODE and looking at its asymptotic
behaviour.
One can show that,

lim sup ||V (t) − V Tn (t)|| = 0, w.p. 1


n→∞ t∈[T ,T
n n+1 ]

Here, V (t), t ≥ 0 is algorithm’s trajectory(continuous interpolated) and V Tn , t ∈ [Tn , Tn+1 ] : ODE


trajectory where V Tn (Tn ) = V (Tn ).
Suppose the ODE has v ∗ as a globally asymptotically stable equilibrium. Then, algorithm will
satisfy Vn → V ∗ almost surely as n → ∞ (under same conditions).
ODE corresponding to TD algorithm,

V̇ (t) = DV (t)
 
γ(1) 0 ··· 0
 0 γ(2) · · · 0 
where D =  .
 
 .. .. 
··· . 0 
0 · · · · · · γ(n) n×n
 Pn 
j=1 P1j (π(1))(Rπ (1, j) + Vπ (j) − Vπ (1))
V (t) = 
 .. 
Pn . 
j=1 Pnj Pnj (π(n))(Rπ (n, j) + Vπ (j) − Vπ (n))

Please note that V (t) = 0


References:
1. A book by V. borkar on stochastic approximation : A dynamical system view point, 2022
(Chapter 01 and chapter 02 for ODE approach)

14.3 TD(λ) algorithm


Consider the (l + 1) step Bellman equation.
l
X
Vπ (ik ) = E[ r(ik+m , ik+m+1 ) + Vπ (ik+l+1 )] (17)
m=0

Since value of l is arbitrary, we can form a weighted average of all such bellman equations.

50
P∞
Let λ ∈ [0, 1). Since l=0 (1 − λ)λl = 1, we can write the following bellman equation,
∞ l
" !#
X X
Vπ (ik ) = (1 − λ)Eπ λl r(ik+m , ik+m+1 ) + Vπ (ik+l+1 )
l=0 m=0
"∞ l
# " ∞
#
X X X
= (1 − λ)Eπ λl r(ik+m , ik+m+1 ) + (1 − λ)Eπ λl Vπ (ik+l+1 )
l=0 m=0 l=0
∞ ∞ ∞
" # " #
XX X
l
= (1 − λ)Eπ λ r(ik+m , ik+m+1 ) + Eπ (λl − λl+1 )Vπ (ik+l+1 )
m=0 l=m l=0
∞ ∞
" #
X X
= (1 − λ)Eπ r(ik+m , ik+m+1 ) λl + Eπ [(1 − λ)Vπ (ik+1 ) + (λ − λ2 )Vπ (ik+2 ) + · · · ]
m=0 l=m

" #
X
= Eπ λm r(ik+m , ik+m+1 ) + E[Vπ (ik+1 ) − Vπ (ik ) + λ(Vπ (ik+2 ) − Vπ (ik+1 )) + · · · ]
m=0
∞ ∞
" #
X X
m
= Eπ λ r(ik+m , ik+m+1 ) + E[ λm (Vπ (ik+m+1 ) − Vπ (ik+m ))] + Vπ (ik )
m=0 m=0

" #
X
m
= Eπ λ (r(ik+m , ik+m+1 ) + Vπ (ik+m+1 ) − Vπ (ik+m )) + Vπ (ik )
m=0

Recall here that, ∀k ≥ N (terminal instant),

ik = 0, r(ik , ik+1 ) = 0, Vπ (ik ) = 0

Letting dm = r(im , im+1 )+Vπ (im+1 )−Vπ (im ) and these are called temporal difference terms. Then,
" ∞ #
X
Vπ (ik ) = Eπ λm (r(ik+m , ik+m+1 ) + Vπ (ik+m+1 ) − Vπ (ik+m )) + Vπ (ik )
m=0

" #
X
0 = Eπ λm dm+k
m=0

Since from Bellman Equation,

E[dm ] = 0

Stochastic approximation version of these would be:



X
V (ik ) := V (ik ) + α λm−k dm
m=k

where dm = r(im , im+1 ) + V (im+1 ) − V (im ) and α is your learning rate.


as number of iteration → ∞, V (ik ) → Vπ (ik )

51
15 Lecture 15: Q Learning (Shalabh Bhatnagar)
But suppose we have access states j ∼ Pi,· (u) for all i ∈ S and u ∈ A(i). Then, the Q-Learning
algorithm is
 
Qm+1 (i, u) = Qm (i, u) + γm g(i, u, j) + min Qm (j, v) − Qm (i, u) ,
v∈A(j)

2
P P
where γ ∼ pi,· (u). The γm is selceted such that m γm = ∞ and m γm < ∞.
Proposition Consider the following algorithm:

γt+1 (i) = (1 − γt (i))rt (i) + rt (i)((Hrt )(i) + ωt (i)),

where
P P 2
1. t rt (i) = ∞ and t rt (i) < ∞.

2. (a) For all i, t, E [ωt (i) | Ft ] = 0 where Ft = σ(rs , s ≤ t, ωs , s < t).


(b) ∃ A, B > 0 such that E ωt2 (i) | Ft ≤ A + B||rt ||2 ∀ i, t.
 

3. H : Rn 7→ Rn is a weighted max norm pseudo-contraction, i.e., ∃ r∗ ∈ Rn and a positive


vector E = (E(1), . . . , E(n))T and a constant β ∈ [0, 1) such that

||Hr − r∗ ||E ≤ β||r − r∗ ||E ∀ r ∈ Rn .


|r(i)| a.s.
Here, ||r||E = maxi∈S E(i) . Then, rt −−→ r∗ as t → ∞, i.e., P (limt→∞ rt = r∗ ) = 1.
We will not prove this general result but use it to show the convergence of the Q-Learning algorithm.
Proposition (Q Learning Convergence) Consider the Q-Learning algorithm:
 
Qt+1 (i, u) = (1 − γt (i, u)) Qt (i, u) + γt (i, u) g(i, u, j) + min Qt (ī, v) ,
v∈A(ī)

where ī ∼ Pi,· (u). Let Qt (0, u) = 0 for all u ∈ A(0). LetPT i,u denote the set
P of all times at which
Q(i, u) is updated. Let γt (i, u) = 0 for all t ̸∈ T and t γt (i, u) = ∞, t γt2 (i, u) < ∞. Then,
i,u
a.s.
Qt (i, u) −−→ Q∗ (i, u) as t → ∞ for all i ∈ S and u ∈ A(i) in both the following cases:
(i) All policies are proper.
(ii) Assumptions (A) and (B) hold.

Proof. Define the mapping H as follows:


Xn  
(HQ)(i, u) = pij (u) g(i, u, j) + min Q(j, v) , ∀ i ̸= 0, u ∈ A(i).
v∈A(j)
j=1

The Q-Learning algorithm can then be rewritten as

Qt+1 (i, u) = (1 − γt (i, u))Qt (i, u) + γt (i, u) ((HQt )(i, u) + ωt (i, u)) ,

where
n
X  
ωt (i, u) = g(i, u, ī) + min Qt (ī, v) − pij (u) g(i, u, j) + min Qt (j, v) .
v∈A(ī) v∈A(j)
j=1

52
Observe that E [ωt (i, u) | Ft ] = 0 for all i ∈ S and u ∈ A(i). Furthermore, ∃ a constant k > 0 such
that  
 2  2
E ωt (i, u) | Ft ≤ k 1 + max |Qt (j, v)| .
j∈S,v∈A(j)

Then, assumption (B) holds. Suppose now that all policies are proper. Then, we have shown that
∃ ξ(i) > 0 for all i ̸= 0 and β ∈ [0, 1) such that
n
X
pij (u)ξ(j) ≤ βξ(i) ∀ i ̸= 0, u ∈ A(i).
j=1

Let
Q = (Q(i, u), i ∈ S, u ∈ A(i))T .
|Q(i,u)|
Define the norm ||Q||ξ = maxi∈S,u∈A(i) ξ(i) . Consider two vectors Q and Q̄. Then,
n
X
||(HQ)(i, u) − (H Q̄)(i, u)||ξ ≤ pij (u) min Q(j, v) − min Q̄(j, v) .
v∈A(j) v∈A(j)
j=1

We will prove that the above summation is upper bounded by


n
X
pij (u) max Q(j, v) − Q̄(j, v) .
v∈A(j)
j=1

Then,
n
X
||(HQ)(i, u) − (H Q̄)(i, u)||ξ ≤ pij (u) min Q(j, v) − min Q̄(j, v)
v∈A(j) v∈A(j)
j=1
Xn
≤ pij (u) max Q(j, v) − Q̄(j, v)
v∈A(j)
j=1
n
! !
X Q(j, v) − Q̄(j, v)
≤ pij (u) max ξ(j)
v∈A(j) ξ(j)
j=1
n
!!
X Q(j, v) − Q̄(j, v)
≤ pij (u) max · ξ(j)
v∈A(j) ξ(j)
j=1
Xn
≤ pij (u) max ||Q − Q̄||ξ · ξ(j)
v∈A(j)
j=1
 
n
X
≤ β||Q − Q̄||ξ · ξ(i) as pij (u)ξ(j) ≤ βξ(i) .
j=1

Dividing by ξ(i), we get


||(HQ)(i, u) − (H Q̄)(i, u)||ξ
≤ β||Q − Q̄||ξ .
ξ(i)
Since the inequality holds for all i ∈ S where |S| < ∞, we take the maximum to conclude that

||HQ − H Q̄||ξ ≤ β||Q − Q̄||ξ .

53
Therefore, H is a weighted max norm pseudo-contraction. The result follows from the general
result.
We will now show that

min Q(j, v) − min Q̄(j, v) ≤ max Q(j, v) − Q̄(j, v) .


v∈A(j) v∈A(j) v∈A(j)

Note that if A ⊂ B then inf x∈A f (x) ≥ inf x∈B f (x). Therefore,

inf (f (x) + g(x)) = inf (f (x) + g(y)) ≥ inf (f (x) + g(y))


x∈A x,y∈A,x=y x,y∈A

=⇒ inf (f (x) + g(x)) ≥ inf f (x) + inf g(x).


x∈A x∈A x∈A

For f = f − g and g = g, we get

inf f (x) ≥ inf (f − g)(x) + inf g(x)


x∈A x∈A x∈A
=⇒ inf (f (x) − g(x)) ≤ inf f (x) − inf g(x).
x∈A x∈A x∈A

Let h = −g. Then, supx∈A h(x) = − inf x∈A g(x). Therefore,

inf (f (x) + h(x)) ≤ inf f (x) + sup h(x)


x∈A x∈A x∈A
=⇒ inf (f (x) − g(x)) − inf f (x) ≤ sup g(x).
x∈A x∈A x∈A

Let h(x) = g(x) − f (x). Then,

inf g(x) − inf f (x) ≤ sup(g(x) − f (x))


x∈A x∈A x∈A
≤ sup |g(x) − f (x)| .
x∈A

Similarly, we can show that inf x∈A f (x) − inf x∈A g(x) ≤ supx∈A |f (x) − g(x)|. Therefore,

inf f (x) − inf g(x) ≤ sup |f (x) − g(x)| .


x∈A x∈A x∈A

Now, we have

min Q(j, v) − min Q̄(j, v) ≤ max Q(j, v) − min Q̄(j, v)


v∈A(j) v∈A(j) v∈A(j) v∈A(j)

≤ max Q(j, v) − Q̄(j, v) .


v∈A(j)

This completes the proof.

Suppose state St is visited at time t. Then, the Q-Learning algorithm in the online setting is
 
Qt+1 (St , At ) = Qt (St , At ) + γt (St , At ) g(St , At , St+1 ) + min Qt (St+1 , v) − Qt (St , At ) ,
v∈A(St+1 )

where Qt+1 (s, a) = Qt (s, a) for all s ̸= St or a ̸= At .

54
The question is how do we select At in the update rule? The answer is to select At randomly from
the set A(St ). An alternative way of rewriting the above is

Qt+1 (St , At ) = Qt (St , At ) + γt (St , At ) (g(St , At , St+1 ) + Qt (St+1 , At+1 ) − Qt (St , At )) .

One possibility is
(
arg minv∈A(St ) Qt (St , v) with probability 1 − ϵ,
At =
randomly selected from A(St ) with probability ϵ.

and
At+1 = arg min Qt (St+1 , v).
v∈A(St+1 )

Another update rule is SARSA (State-Action-Reward-State-Action) algorithm. The update rule is


again the same as Q-Learning

Qt+1 (St , At ) = Qt (St , At ) + γt (St , At ) (g(St , At , St+1 ) + Qt (St+1 , At+1 ) − Qt (St , At )) ,

where
(
arg minv∈A(St ) Qt (St , v) with probability 1 − ϵ,
At =
randomly selected from A(St ) with probability ϵ.
(
arg minv∈A(St+1 ) Qt (St+1 , v) with probability 1 − ϵ,
At+1 =
randomly selected from A(St+1 ) with probability ϵ.

They are called the off-policy algorithm (Q-Learning) and the on-policy algorithm (SARSA). Off-
policy algorithms are more popular in practice. Refer to the book by Sutton and Barto for more
details on Double Q-Learning, expected SARSA, etc.
Professor Gugan will do the function approximation method and the basics of stochastic approxi-
mation algorithms. Whatever he will teach will involve Lipschitz continuity (such as policy gradient
methods), but largely, he will cover function approximation methods.

55
16 Lecture 16
Rohit
Gonna write Monday night

56
17 Lecture 17: Application Of Stochastic Approximation To RL
Given a µ, our goal is to approximate Jµ given by:
"∞ #
X
Jµ = E γ t r(st , at )|s0 = s
t=0

The Bellman analogue is:

r(s0 , a0 ) + γJµ (s′ )|s0 = s


 
Jµ = ′E
s ,a0

X
µ(a|s)P (s′ |s, a) r(s, a) + γJµ (s′ )
 
Jµ (s) =
a,s′

In this setting, our goal is to find Jµ , we have to solve this system of s linear equations in s
unknowns.(Assuming we know the probabilities P and µ).
In the model free setup, we don’t know P. We try to exploit Laws of Large numbers here. After a
few runs, by SLLN,
X1 + X2 + . . . Xn a.s
−−→ E[X]
n

Jµ is an infinite sum. How do we get samples? We call one sample as one run till termination,
−1
TP
(s0 , a0 , r(s0 , a0 ), s1 , a1 , r(s1 , a1 ), . . . sT ) and calculate γ t r(st , at ) + γ T r(sT ).
t=0
−1
TP
For each s, we collect k samples of γ t r(st , at ) + γ T r(sT ) where s0 = s.
t=0
C1 (s)+C2 (s)...Ck (s)
Call it C1 (s), C2 (s) . . . Ck (s). Jµ (s) = k

With this naive approach, both space and time grows linearly with n. This approach is non-
incremental, i.e. you don’t reuse your samples.
X1 + X2 + . . . Xn
xn =
n
We can rewrite this as:
(n − 1)xn−1 + Xn 1
xn = = xn−1 + (Xn − xn−1 )
n n

More generally,
xn = xn−1 + αn (Xn − xn−1 )

This does not have linear space growth and is incremental.


Define an optimization problem:
1
f (x) = (x − E[x])2
2

∇f (x) = x − E[x]

57
Let’s do a gradient descent:

xn+1 = xn + αn (−∇f (xn )) = xn + αn (E[x] − x)

We can relate this to the above obtained result. So what we got before is stochastic gradient
descent.
So we can write:
xn = xn−1 + αn (Xn − xn−1 )
ˆ (s) + α (C − J n−1
Jˆµn (s) = Jµn−1 ˆ (s))
n n µ

17.1 TD Algo

xn+1 = xn + αn (r(sn , an ) + γxn (sn+1 ) − xn (sn ))esn

Here, xn+1 and xn ∈ Rs


Start with arbitrary (s0 , a0 , r(s0 , a0 ), s1 ) and (s1 , a1 , r(s1 , a1 ), s2 ). We can show xn →
− Jµ under
some conditions.
Let us define another optimization problem.

1 1X
f (x) = ||Jµ − x||2D = d(s)[Jµ (s) − x(s)]2
2 2 s

X
xn+1 = xn + αn (−∇f (xn )) = xn + αn ( d(s)[Jµ (s) − xn (s)]es )
s

Here there are no samples involved. We do not know Jµ (s).


We replace Jµ (s) by the Bellman equation:

X
xn+1 = xn + αn ( d(s)µ(a|s)P(s′ |s, a)([r(s, a) + γJµ (s′ ) − xn (s)]es ))
s,a,s′

We do not know Jµ (s′ ), an infinite sum but substitute it with xn (s′ ). But now this cannot be
viewed as the gradient of the earlier objective function.
We can view d(s)µ(a|s)P(s′ |s, a) as the distribution of (s, a, s′ ).

xn+1 = xn + αn (r(sn , an ) + γxn (s′n ) − xn (sn ))esn

where,
sn ∼ d, an ∼ µ(.|sn ), s′n ∼ P(.|sn , an )

This is the TD-0 algorithm. Assume we can sample from d; this is model-free. We do not know P
but we should be able to sample from d.

58
If we allow a markov chain to evolve, then we get the stationary distribution, which is d.
The algo we obtained at the end is slightly different from the previous algo, written as:
Prev Algo(TD-0 with markov sampling):

(s0 , a0 , r(s0 , a0 ), s1 ), (s1 , a1 , r(s1 , a1 ), s2 )

TD-0 Algo with fresh sampling:

(s0 , a0 , r(s0 , a0 ), s′0 ), (s1 , a1 , r(s1 , a1 ), s′1 )

This cannot be expressed as the gradient of any function. So they are studied under stochastic
approx algos.

59
18 Lecture 3: Temporal Difference Learning and Function Ap-
proximation
18.1 Markov Decision Processes (MDP)
To describe an MDP, we need the tuple:

(S, A, P, r, γ)

For a Markov Chain (MC), we only need:

(S, P )

The transition probabilities differ in the MDP setting. We denote the MC transition probability
as: X
Pµ (s′ |s) = µ(a|s)P (s′ |s, a)
a

If we start at some arbitrary state s0 and allow the Markov chain to evolve, after n steps we reach
sn . If the Markov chain is well-behaved, it converges to a stationary distribution dµ , independent
of n, satisfying:
dTµ Pµ = dTµ

18.2 Temporal Difference (TD) Learning


We analyze TD algorithms, transitioning from the tabular case to a function approximation setting.

18.2.1 Linear Function Approximation


We focus on linear function approximation, where another common approach is using neural net-
works.
Linear Function Approximation:

Φ ∈ RS×d , d≪S

x ∈ col(Φ)
New goal: Find θ∗ such that:
Jµ ≈ Φθ∗

18.2.2 Objective Function and Update Rule


We define an objective function:
1
f (θ) =∥Φθ − Jµ ∥2Dµ
2
where Dµ = diag(dµ ) is a positive definite matrix.
Expanding:
X1 2
f (θ) = dµ (s) ΦT (s)θ − Jµ (s)
s
2

60
Gradient descent update:
θn+1 = θn + αn [−∇f (θn )]

Computing the gradient:


X
dµ (s) ΦT (s)θ − Jµ (s) Φ(s)

∇f (θ) =
s

Update rule: X
dµ (s) Jµ (s) − ΦT (s)θ Φ(s)

θn+1 = θn + αn
s

We don’t know Jµ (s), so let’s use the Bellman equation.

We get,
X
dµ (s)µ(a|s)P (s′ |s, a) r(s, a) + γJµ (s′ ) − ΦT (s)θn Φ(s)
 
θn+1 = θn + αn
s,a,s′

18.3 TD(0) with Linear Function Approximation


Our new algorithm is given by:

θn+1 = θn + αn r(sn , an ) + γΦT (s′n )θn − ΦT (sn )θn Φ(sn )


 

where:
sn ∼ dµ (·)
an ∼ µ(·|sn )
s′n ∼ P (·|sn , an )

and (sn , an , s′n ) are i.i.d.

This is TD(0) with linear function approximation.

In tabular TD(0), the whole operation happens in an s-dimensional space, whereas here it happens
in a d-dimensional space. This reduces time complexity.

We want to write it in the form of a stochastic approximation algorithm:

θn+1 = θn + αn [h(θn ) + Mn+1 ]

Let us define a σ-field first:

Fn = σ θ0 , s0 , a0 , r(s0 , a0 ), s1 , a1 , r(s1 , a1 ), s2 , . . . , sn−1 , an−1 , r(sn−1 , an−1 ), s′n




Since we have this sigma-field, we know that:

θ0 , θ1 , . . . , θn ∈ Fn

61
which means they are measurable with respect to Fn .

Now, let:
δn = r(sn , an ) + γΦT (s′n )θn − ΦT (sn )θn

We don’t know r(sn , an ), s′n , or sn , an , but we know θn , if we have Fn .

We define:
h(θn ) = E [δn Φ(sn ) | Fn ]

Expanding it:

= E r(sn , an )Φ(sn ) + γΦT (s′n )θn Φ(sn ) − ΦT (sn )θn Φ(sn ) | Fn


 

By the linearity of conditional expectation:

= E [r(sn , an )Φ(sn ) | Fn ] + E γΦT (s′n )Φ(sn ) | Fn θn − E ΦT (sn )Φ(sn ) | Fn θn


   

We make the assumption that sn is independent of the conditioning term (as sn is sampled fresh):

E [r(sn , an )Φ(sn )] + E γΦT (s′n )Φ(sn ) θn − E ΦT (sn )Φ(sn ) θn


   

Rewriting in summation form: X


= dµ (s)µ(a|s)r(s, a)Φ(s)
s,a

Finally we get:
b = ΦT Dµ rµ , Aθ = γΦT Dµ Pµ Φθn − ΦT Dµ Φθn

Thus, our update rule is:


θn+1 = θn + αn [h(θn ) + Mn+1 ]
We define:
Mn+1 = δn Φ(sn ) − h(θn )
where:
E[Mn+1 |Fn ] = E[δn Φ(sn )|Fn ] − E[h(θn )|Fn ]
which has conditional expectation 0.

where:
h(θ) = b − Aθ
b = ΦT Dµ rµ , A = ΦT Dµ (I − γPµ )Φ
with the condition:
E[Mn + 1|Fn ] = 0

62
so,
θn+1 = θn + αn [b − Aθn + Mn+1 ]
This can be viewed as a noisy Euler approximation of:

θ(t2 ) = θ(t1 ) + (t2 − t1 )h(θ(t1 ))

63
19 Lecture 19 (Gugan Thoppe)
Recall that we assumed that the feature matrix Φ is given to us. We want to minimize
1
f (θ) = ||Jµ − Φθ||2 .
2
We came up with the update rule
θn+1 = θn + αn r(sn , an ) + γΦT (s′n )θn − ΦT (sn )θn Φ(sn ).


The algorithm can be written as


θn+1 = θn + αn (b − Aθn + Mn+1 ) ,
where
b = ΦT Dµ rµ
A = ΦT Dµ (I − γPµ )Φ
Mn+1 = δn Φ(sn ) − (b − Asn ).
Further, E [Mn+1 | Fn ] = 0. In this lecture, we will focus on what would the algorithm do if we had
known b and A followed by convincing ourselves that the noisy and unnoisy algorithms behaves in
a way we like (basically like Emma Watson, picture-perfect and flawless, but the cold truth of the
world is that very few people are like her. I need a waifu like her, so lonely).
First we will analyze the noiseless version of the algorithm.
• The behaviour of this algorithm is governed by the ODE
θ′ (t) = h(θ(t)) = b − Aθ(t).
This can be justified as
Z t2 Z t2
θ(t2 ) − θ(t1 ) = θ′ (t) dt = h(θ(t)) dt
t1 t1
≈ h(θ(t1 ))(t2 − t1 ).
Hence, θ(t2 ) ≈ θ(t1 )+h(θ(t1 ))(t2 −t1 ). This is the Euler’s method. The ODE is a continuous-
time version of the algorithm.
θ(t2 ) ≈ θ(t1 ) + (t2 − t1 )(b − Aθ(t1 )).
Hence, the algorithm can be viewed as noisy Euler’s method (noisy due to the term Mn+1 ).
n→∞
The gap between Euler’s method and the algorithm goes to zero as αn −−−→ 0.
• Under some nice conditions, the limiting behaviour of a stochastic approximation algorithm
matches the behaviour of a deterministic ODE.
Let us analyze this sexy ODE. The ODE is
θ′ (t) = b − Aθ(t).
The equilibrium point is θ∗′ = A−1 b (we will see why A is invertible). The θ∗′ is asymptotically
t→∞
stable (i.e., θ(t) −−−→ θ∗′ ). We will answer two questions in the rest of this lecture:
• If θ∗′ asymptotically stable?
• Any connection between θ∗ and θ∗′ ?

64
19.1 Is θ∗′ asymptotically stable?
19.1.1 Lyaupnov functions
Let V : Rd 7→ R be given by
1
V (θ) = ||θ − θ∗ ||2 .
2
If ∇V (θ(t))T h(θ) < 0 for all θ ̸= θ∗ then

dV (θ(t))
= ∇V (θ(t))T θ′ (t) = ∇V (θ(t))T h(θ(t)) < 0
dt
for all θ ̸= θ∗ . In this case any trajectory starting from θ(0) ̸= θ∗ will converge to θ∗ . The function
V is called a Lyapunov function. Now let us verify this condition.

∇V (θ) = θ − θ∗
∇V (θ)T h(θ) = (θ − θ∗ )T (b − Aθ) (assuming A is invertible)
T
= −(θ − θ∗ ) A(θ − θ∗ ).

If we can show that A is positive definite then it is invertible and the above expression is negative.
Lemma θT Aθ > 0 for all θ ∈ Rd \ {0}.

Proof. Recall that A = ΦT Dµ (I − γPµ )Φ.

θT Aθ = θT ΦT Dµ (I − γPµ )Φθ
= y T Dµ (I − γPµ )y where y = Φθ.

We will assume henceforth Φ has full column rank. Then, it suffices to show that B = Dµ (I − γPµ )
is positive definite, i.e., for all y ̸= 0,

y T Dµ y − γy T Dµ Pµ y > 0.

Claim 1 y T Dµ y ≥ y T Dµ Pµ y for all y ∈ Rn . Then, it implies that

y T Dµ y − γy T Dµ Pµ y ≥ y T Dµ Pµ y − γy T Dµ Pµ y
= (1 − γ)y T Dµ y > 0.

We further assume that the stationary distribution µ is positive to satisfy y T Dµ y > 0 for all y ̸= 0,
which holds if the chain is irreducible and recurrent.

Proof.
1 1
y T Dµ Pµ y = (y T Dµ2 )(Dµ2 Pµ y)
1 1
≤ ||Dµ2 y|| · ||Dµ2 Pµ y|| = ||y||Dµ · ||Pµ y||Dµ .

Claim 2 ||Pµ y||2Dµ ≤ ||y||2Dµ .

65
Proof. The left-hand side is
X 2
Dµ (s) PµT (s, ·)y .
s

Recall the Jensen’s inequality, for any convex function f , f (E[X]) ≤ E[f (X)]. Let f (x) = x2 . Then,
T
2 2
≤ s Dµ (s) PµT (s, ·)y . This completes
P P
f (E[X]) ≤ E[f (X)] implies that s Dµ (s)Pµ (s, ·)y
the proof.

One would expect that the noisy algorithm would behave in a similar manner but before that
should we be excited about θ∗′ ?
Lemma Recall θ∗′ = A−1 b, where A = ΦT Dµ (I − γPµ )Φ and b = ΦT Dµ rµ . Then,

ΠTµ Φθ∗′ = Φθ∗′ ,

Φθ∗′ satisfies the projected Bellman equation and is the fixed point of the projected Bellman oper-
ator, ΠTµ , where Π = Φ(ΦT Dµ Φ)−1 ΦT Dµ .

Proof. Left as an exercise.

The geometrical interpretation of the above lemma is that θ∗′ is the projection of Φθ∗ onto the
column space of Φ. Verify that
1
||J − ΠV ||2Dµ = min ||J − Φθ||2Dµ .
θ 2

The projected Bellman operator ΠTµ is a contraction in the Dµ norm. The fixed point of the
projected Bellman operator is the projection of the optimal value function onto the column space
of Φ.

66
20 Lecture 20 (Gugan Thoppe)
Recall
1 1X 2
f (θ) = ||Jµ − Φθ||2 = Dµ (s) Jµ (s) − Φ(s)T θ .
2 2 s
The gradient of f is X
Dµ (s) Jµ (s) − Φ(s)T θ Φ(s).

∇f (θ) = −
s
The Hessian is X 1
∇2 f (θ) = Dµ (s)Jµ (s)Φ(s) = θT ΦT Dµ Φθ > 0.
s
2
Thus, the optimal θ∗ is given by ∇f (θ∗ ) = 0. The optimal θ∗ is the unique minimizer of f given by

θ∗ = (ΦT Dµ Φ)−1 ΦT Dµ Jµ .

Thus,
Φθ = Φ(ΦT Dµ Φ)−1 ΦT Dµ Jµ .
This Φθ is the closest point in the column space of Φ to Jµ in the Dµ norm. We call Π =
Φ(ΦT Dµ Φ)−1 ΦT Dµ the projection operator. The projected Bellman operator is ΠJµ .
Φθ∗′ is the fixed point of the equation ΠTµ Φθ = Φθ. Now we wish to compare ∥Jµ − Φθ∗′ ∥Dµ with
∥Jµ − Φθ∗ ∥Dµ . From th definition of θ∗ , we get

∥Jµ − Φθ∗ ∥Dµ ≤ ∥Jµ − Φθ∗′ ∥Dµ .

Now,

∥Jµ − Φθ∗ ∥Dµ ≤ ∥Jµ − ΠJµ ∥Dµ + ∥ΠJµ − Φθ∗ ∥Dµ


≤ ∥Jµ − ΠJµ ∥Dµ + ∥ΠJµ − ΠTµ Φθ∗ ∥Dµ .

Claim We claim that


∥ΠV − ΠV ′ ∥Dµ ≤ ∥V − V ′ ∥Dµ .
Using the above claim, we get that

∥Jµ − Φθ∗ ∥Dµ ≤ ∥Jµ − ΠJµ ∥Dµ + ∥Jµ − Tµ Φθ∗ ∥Dµ


= ∥Jµ − ΠJµ ∥Dµ + ∥Tµ Jµ − Tµ Φθ∗ ∥Dµ .

Recall that Tµ is a contraction map in some norm and since the space is finite dimensional, all
norms are equivalent. Thus, ∥Tµ Jµ − Tµ Φθ∗ ∥Dµ ≤ γ∥Jµ − Φθ∗ ∥Dµ . Therefore,

∥Jµ − Φθ∗ ∥Dµ ≤ ∥Jµ − ΠJµ ∥Dµ + γ∥Jµ − Φθ∗ ∥Dµ .

Hence,
1
∥Jµ − Φθ∗ ∥Dµ ≤ ∥Jµ − ΠJµ ∥Dµ .
1−γ
If γ is close to 1, the distance between ∥Jµ − Φθ∗ ∥Dµ and ∥Jµ − ΠJµ ∥Dµ can be very large.
Claim (θn )n≥0 generated using the noisy algorithm

θn+1 = θn + αn (b − Aθn + Mn+1 )

converges almost surely to θ∗′ .

67
Proof. We will verify the four assumptions of the result proved by Michael Bonaim in 1996 (second
chapter of Borkar).
(A1) Let h(x) = b − Ax. Then, h is Lipschitz continuous with constant L = ||A|| as

||h(x) − h(y)|| = ||A(x − y)|| ≤ ||A|| · ||x − y||.

P P 2 1
(A2) Choose step sizes αn such that n αn = ∞ and n αn < ∞, such as αn = np where
p ∈ (0.5, 1]. In general, step sizes are kept constant for some time and then decreased to zero
to obtain a faster convergence rate. If step size is kept constant the noise term won’t go to
zero. If it is decreased too-fast then rate of convergence would be very slow. The choice of
step size is a trade-off between these two and is domain specific.
(A3) (Mn )n≥1 be a square integrable martingale difference sequence with respect to the filtration
(Fn )n≥1 , i.e.,
E∥Mn ∥2 < ∞, E[Mn+1 | Fn ] = 0.
Furthermore,
E ∥Mn+1 ∥2 | Fn ≤ K 1 + ∥θn ∥2 .
  

The assumption E[Mn+1 | Fn ] = 0 follows as Mn+1 = δn Φ(θn ) − (b − Aθn ). We will now verify
the third assumption. We have

Mn+1 = δn Φ(θn ) − (b − Aθn ).

The first term is δn Φ(θn ) = r(sn , an )Φ(θn ). Thus, ∥r(sn , an )Φ(θn )∥ ≤ |r(sn , an )|∥Φ(θn )∥. If
all rewards are bounded, i.e., |r(sn , an )| ≤ Rmax and assume that the feature space is upper
bounded by one (wlog as we can always normalize the feature space), then ∥r(sn , an )Φ(θn )∥ ≤
Rmax . Now, b = E[r(sn , an )Φ(θn ) | Fn ]. Thus, ∥b∥ ≤ Rmax . Further,

A = E γΦ(sn )ΦT (s′n ) − Φ(sn )ΦT (sn ) | Fn .


 

Now, ∥γΦ(sn )ΦT (s′n )∥ = |γ|∥Φ(sn )∥∥Φ(s′n )∥ ≤ γ. Therefore, ∥A∥ ≤ γ + 1 and

Mn+1 = δn Φ(θn ) − (b − Aθn ).

Therefore,

∥Mn+1 ∥2 ≤ 6 max{Rmax , γ 2 + 1} 1 + ∥θn ∥2 .




We will show (A4) and part (a) of (A3) in the next lecture.

68

You might also like