0% found this document useful (0 votes)

13 views5 pages

Machine Learning

Machine learning

Uploaded by

Mr. RAVI KUMAR I

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views5 pages

Machine Learning

Machine learning

Uploaded by

Mr. RAVI KUMAR I

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Lecture 5: Markov Decision Processes

Akshay Krishnamurthy
[email protected]
February 21, 2022

1 Recap and Introduction

So far, we have been discussing simple exploration problems in various “bandit” formulations. In all of these
problems the agent chooses actions which influence the immediate reward but, in the bandit protocols, these actions
have no further influence on future interactions. This means that our actions have no long-term consequences, or
equivalently, credit assignment is relatively straightforward. However, many sequential decision making scenarios
cannot be effectively modeled as bandit problems because actions do have long-term consequences. Developing
algorithms for these settings requires introducing new interaction protocols/models.
While we have seen relatively sophisticated algorithms for bandit problems, such as LinUCB and SquareCB, to
incorporate credit assignment, we’ll have to take a step back and simplify considerably. In particular, today we’ll
think only about the credit assignment capability, in the absence of both exploration and generalization.

2 Markov Decision Processes

A Markov decision process formalizes a decision making problem with state that evolves as a consequence of the
agents actions. The schematic is displayed in Figure 1

r0 r1 r2

s0 s1 s2 s3

a0 a1 a2

Figure 1: A schematic of a Markov decision process

Here the basic objects are:

• A state space S, which could be finite or infinite. For now let us think of this as finite and relatively small.
• An action space A, which could also be finite or infinite. Again let’s assume this is small for now.
• A reward function R : S × A → ∆([0, 1]) that associates a distribution over rewards to each state-action pair.
We’ll say that r ∼ R(s, a) and also that r(s, a) := E[r | s, a] which is a slight abuse of notation.
• A transition operator P : S × A → ∆(S) that associates a distribution over next states to each state-action
pair. As with the reward, we’ll say that s0 ∼ P (s, a) to denote a sample drawn from this operator.
• An initial state distribution µ ∈ ∆(S) that describes how the initial state s0 will be chosen.
A key property is that the dynamics are Markovian, meaning that the distribution of the next state s0 and reward
r depend only on the most recent state s and action a. With these basic objects there are many formulations.

1
3 Finite horizon, episodic setting
Let us start with the simpler finite horizon setting. In this setting, there is also a horizon H ∈ N that describes how
long an episode lasts. With horizon H an episode produces a trajectory τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . . , sH−1 , aH−1 , rH−1 )
in the following manner: s0 ∼ µ, sh ∼ P (sh−1 , ah−1 ) and rh ∼ R(sh , ah ) for each h, and all actions a0:H−1 are
chosen by the agent. The Markov property simply means that (sh , rh ) ⊥ sh−2 | (sh−1 , ah−1 ) so the entire past is
summarized in the current state. Note however that we can choose actions in a non-Markovian manner.

3.1 Policies, Value functions, and Objective

To set up the objective, we must introduce the concept of a policy. In general, a policy is a decision making strategy
that makes a (possibly randomized) decision based on the current trajectory, that is π : H → ∆(A) where H is the
set of all partial trajectories. A non-stationary Markov policy instead compresses the history to the current state
and time-step only, that is π : S × [H] → ∆(A). For such policies, we use the notation πh (s) ∈ ∆(A) to denote the
action distribution on state s and time step h. If the policy is deterministic, then the range is just A.

Objective. Every policy π has a value, which is the total reward we expect to accumulate if we deploy policy π.
This is defined as
"H−1 #
X
J(π) := E rh | s0 ∼ µ, a0:H−1 ∼ π
h=0

The objective in an MDP is to find a policy π that maximizes J(π).

Value functions. For a Markov policy π, we can consider a more-refined notion: how much reward would we
get if we started in state s and time h and executed policy π until the end of the episode? This defines the value
function or the state-value function
" H−1 #
X
π
Vh (s) := E rh0 | sh = s, ah:H−1 ∼ π .
h0 =h

The state-action value function or just action-value function is similar. It describes how much reward we would get
if we started in state s at time h, took action a first, then followed π for the subsequent steps. This is defined as
" H−1 #
X
π
Qh (s, a) := E rh0 | sh = s, ah = a, ah+1:H−1 ∼ π .
h0 =h

It is worth making a couple of observations. First, we can write Vhπ (s) = Eah ∼πh (s) [Qπh (s, ah )] to relate the
two types of value functions. A related and more fundamental property is that these functions satisfy a recursive
relationship, known as Bellman equations (or Bellman equations for policy evaluation):

Vhπ (s) = E rh + Vh+1

π

(sh+1 ) | sh = s, ah ∼ π
Qπh (s, a) = E rh + Qπh+1 (sh+1 , ah+1 ) | sh = s, ah = a, ah+1 ∼ π

(1)

Intuitively, the reward we expect to get by following π from (s, h) is the immediate reward rh plus the reward we
expect to get by following π from the next state and the next time step.
Next, if π is Markov, then J(π) = Es0 ∼µ [V0π (s0 )] which relates the MDP objective to the value functions.
Finally, for non-Markov policies, we cannot really define these functions since the policy’s actions may depend on
information from the past that is not available. However, we will next see that the optimal policy is Markovian,
which justifies our effort in setting up these definitions.

2
3.2 Optimality and Bellman equations
A fundamental result in the theory of Markov Decision processes is that the optimal policy is Markovian and the
optimal value functions satisfy an elegant recursive definition. This is captured in the following theorem.
Theorem 1. Define the value function (V0? , . . . , VH? ) recursively as
VH? ≡ 0, ∀(s, h) : Vh? (s) = max r(s, a) + Es0 ∼P (s,a) Vh+1 (s0 ) .
?
(2)
a

Then supπ J(π) = Es0 ∼µ [V0? (s0 )] where the supremum is taken over all, possibly non-Markovian, policies. Addi-
tionally, define the policy π ? := (π0? , . . . , πH−1
?
) as

∀(s, h) : πh? (s) = argmax r(s, a) + Es0 ∼P (s,a) Vh+1 (s0 )

?
a
? ?
Then π achieves value V for all (s, h) pairs and hence π ? is an optimal policy.
The two important aspects of this theorem are: (a) we need not consider non-Markovian policies since the
optimal policy is Markov, (b) there is a simple recursive formula for computing the optimal value function and the
optimal policy can be concisely described in terms of this value function.
An analogous results holds for the action-value functions if we define them as
h i
0 0
Q?h (s, a) = r(s, a) + Es0 ∼P (s,a) max
0
Q?
h+1 (s , a ) . (3)
a

Then we can simply define the optimal policy as πh?

: s 7→ argmaxa Q?h (s, a). Both (2) and (3) are referred to as
Bellman optimality equations. Note that these equations show us how to address credit assignment in some sense.
Indeed it could be that r(s, a) Q?h (s, a), since taking action a in state s leads to some very favorable conditions
later in the episode. So defining πh? to maximize Q?h leads to “non-myopic” behavior which is essential for decision
making with a long horizon.
Proof. The proof is based on the dynamic programming principle or an induction argument. The key fact is that
π ? is optimal pointwise, meaning from every state-time pair. This means that we can ignore how we arrive at a
distribution over states (i.e., the past) when optimizing for the future.
As the base case, consider time step H. There are no further actions and no further rewards, so all policies
accumulate 0 reward from here on. This means that VH? = 0 correctly describes the optimal reward achievable at
the H th time step.
?
For the induction, assume that Vh+1 (s) satisfies the following two properties:
• The Markovian policy (πh+1
? ?
, . . . , πH−1 ?
) achieves value Vh+1 for each s ∈ S at time h + 1.
hP i
H−1
• For all possibly non-Markovian policies π̃ we have E
?
0 rh 0 | π̃ ≤ E Vh+1 (sh+1 ) | π̃ .
h =h+1

(Note that these two properties hold for VH? trivially.) The second property allows us to study non-Markovian
?
policies, since we can upper bound their future value via our function Vh+1 .
?
We establish these two properties for the value function Vh defined via (2). With πh? (s) = argmaxa r(s, a) +
0
?
Es0 ∼P (s,a) Vh+1 (s ) , the first property holds simply by the definition and the first inductive hypothesis.
For the second property consider some possibly non-Markovian policy π̃. We have
" H−1 # " H−1
#
X X
E rh0 | π̃ = E rh + rh0 | π̃
h0 =h h0 =h+1
?

≤ E rh + Vh+1 (sh+1 ) | π̃ (Inductive hypothesis)
?

= E E r(s, a) + Vh+1 (sh+1 ) | sh = s, ah = a | π̃ (Iterated expectation)
(s0 ) | sh = s, ah = a | π̃
?
= E E r(s, a) + Es0 ∼P (s,a) Vh+1 (Markov property of sh+1 )

(s0 ) | sh = s | π̃
?
≤ E E max r(s, a) + Es0 ∼P (s,a) Vh+1
a∈A

= E [Vh? (sh ) | π̃] (Definition of Vh? )

3
Concluding the induction, we have value functions (V0? , . . . , VH? ) and a policy (π0? , . . . , πH−1
?
) that achieves the
value function at every (s, h) pair. Additionally, for any π̃:
"H−1 #
X
J(π̃) = E rh | π̃ ≤ E [V0? (s0 ) | π̃] = J(π ? ),
h=0

since s0 ∼ µ does not depend on the choice of policy. This proves the theorem.

3.3 Planning algorithms for the episodic setting

Theorem 1 directly motivates one strategy for computing the optimal value function and policy. This is known as
value iteration. The algorithm is to simply apply (2) or (3) from time step H down to time step 0. Then we can
directly obtain the optimal policy from the value functions we compute.
Another algorithm is called policy iteration. Starting with any non-stationary Markov policy π (0) , in the tth
iteration we update via
(t−1)
1. Compute Qπ via (1).
(t−1) (t) (t−1)
2. Update π (t) to be the greedy policy with respect to Qπ that is πh (s) := argmaxa Qhπ (s, a).
It can be easily seen that this algorithm converges in H iterations. Indeed, observe that even though π (0) is arbitrary,
(1) ?
we have that πH−1 = πH−1 , since QπH−1 actually does not depend on π and is equal to Q?H−1 .

4 Infinite horizon discounted setting

Let us shift gears and consider a different formulation of reinforcement learning in an MDP. This is referred to as
the discounted setting, or infinite horizon discounted setting. Here, instead of a horizon H, we have a discount
factor γ ∈ (0, 1) that captures how much we prefer immediate rewards over future rewards. Here trajectories are
infinitely long τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . .) but follow the same probabilistic model as before. For any policy π we
can define the objective and value function as
"∞ # "∞ # "∞ #
X X X
h π h π h
J(π) := E γ rh | π , V (s) = E γ rh | s0 = s, π Q (s, a) = E γ rh | s0 = s, a0 = a, π
h=0 h=0 h=0

(Note that we are now defining value functions for all policies, including non-Markovian ones.)
Roughly speaking all of the results for the episodic setting carry over here, with appropriate modifications. For
example the optimal policy is Markov and, in fact, stationary (meaning it does not depend on the time step) and
the optimal value function is the unique solution to the following fixed point equation

V ? (s) = max r(s, a) + γEs0 ∼P (s,a) [V ? (s0 )] (4)

This should be viewed as the infinite horizon version of (2), but note that we are not time-indexing V ? and it
appears on both sides of the equation. That said, an analogous result to Theorem 1 holds in this setting, although
it is a bit more subtle.

4.1 Planning algorithms

Both value iteration and policy iteration can be modified for the discounted setting. To describe these algorithms,
and for use in subsequent lectures, it is helpful to define the Bellman operator T as
h i
T f : (s, a) 7→ r(s, a) + γEs0 ∼P (s,a) max f (s0 , a0 ) .
a

Note that this operator takes Q-functions as input and produces a Q-function as output. Then the Bellman
optimality equation (the analog of (3)) is simply that Q? = T Q? .

4
Value iteration. Motivated by the fixed point relationship for Q? , value iteration simply iterates the operation
f (t+1) ← T f (t) starting from an arbitrary initial Q-function f (0) . The key to the convergence of this algorithm is
a certain contraction property for the bellman operator
Lemma 2 (Contraction). For any two Q-functions f, f 0 , we have kT f − T f 0 k∞ ≤ γkf − f 0 k∞ .
Proof. Considering any (s, a) pair we have

|(T f )s,a − (T f 0 )s,a | = |γ Es0 ∼P (s,a) max
0
f (s0 , a0 ) − max
00
f 0 (s0 , a00 ) |
a a
≤ γEs0 ∼P (s,a) max
0
|f (s0 , a0 ) − f 0 (s0 , a0 )|
a
≤ γkf − f 0 k∞

The first inequality follows since, if maxa0 f (s0 , a0 ) ≥ maxa00 f (s0 , a00 ) then:

f (s0 , a0 ) − max
00
f 0 (s0 , a00 ) ≤ f (s0 , a0 ) − f 0 (s0 , a0 ) ≤ max |f (s0 , a) − f 0 (s0 , a)|,
a a

with a similar calculation for the other case.

This will immediately give us a geometric convergence to Q? by iterating the bellman operator, but we need to
translate error in Q? to policy performance. This is given in the next lemma.
The notation here is a bit confusing. We will use f to denote any function of the type S × A → R, which as the
same type as the Q-functions. For a policy π, Qπ is the true action value function for π in the MDP. The confusing
part is that if πf : s 7→ argmaxa f (s, a) is the greedy policy with respect to function f , then Qπf will in general be
distinct from f . This is why we are using f to denote general functions with this type.
2
Lemma 3 (Policy error). For any function f we have J(π ? ) ≤ J(πf ) + 1−γ kf − Q? k∞ .

Proof. Consider state s and let a = πf (s) = argmaxa0 f (s, a0 ). Then

V ? (s) − V πf (s) = Q? (s, π ? (s)) − Q? (s, a) + Q? (s, a) − Qπf (s, a)

≤ Q? (s, π ? (s)) − f (s, π ? (s)) + f (s, a) − Q? (s, a) + Q? (s, a) − Qπf (s, a)
≤ 2kQ? − f k∞ + γEs0 ∼P (s,a) [V ? (s0 ) − V πf (s0 )]
≤ 2kQ? − f k∞ + γkV ? − V πf k∞

Re-arranging this inequality actually proves a stronger statement, namely that V ? and V πf are close, which implies
that J(π ? ) and J(πf ) are close.
Taking these two together, we immediately have an iteration complexity bound for value iteration.

Theorem 4. Set f (0) = 0, run value iteration for T iterations and define π̂ = πf (T ) . Then

2γ T kQ? k∞
J(π ? ) − J(π̂) ≤ .
1−γ

Bhanu LOR
No ratings yet
Bhanu LOR
1 page
Lor Naveen Kumar Ece
No ratings yet
Lor Naveen Kumar Ece
1 page
Adobe Scan 12-Sep-2023 (1)(1)
No ratings yet
Adobe Scan 12-Sep-2023 (1)(1)
1 page
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Tic-Tac-Toe Factoring: A Fun Way To Factor Quadratics!
No ratings yet
Tic-Tac-Toe Factoring: A Fun Way To Factor Quadratics!
12 pages
Sept 2024-1
No ratings yet
Sept 2024-1
3 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
2-Complex Numbers - Class Sheet-1
No ratings yet
2-Complex Numbers - Class Sheet-1
9 pages
1-markov
No ratings yet
1-markov
34 pages
8th Maths Lesson 10 To 12 Eng
No ratings yet
8th Maths Lesson 10 To 12 Eng
2 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Reinforcement Learning - Markov Decision Process
No ratings yet
Reinforcement Learning - Markov Decision Process
11 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
2025_MDPs_Part 2 (1)
No ratings yet
2025_MDPs_Part 2 (1)
41 pages
Guia 2 Arcs and Sectors_solutions
No ratings yet
Guia 2 Arcs and Sectors_solutions
6 pages
CS229
No ratings yet
CS229
17 pages
lec12
No ratings yet
lec12
60 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
F20-AI-L9
No ratings yet
F20-AI-L9
44 pages
CBSE Sample Paper For Class 10 Maths SA 2 Set 8
No ratings yet
CBSE Sample Paper For Class 10 Maths SA 2 Set 8
10 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
REC5C002
No ratings yet
REC5C002
2 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
CO HARDWARE LAB
No ratings yet
CO HARDWARE LAB
4 pages
III-II CSE_ML MID 2_Set-2
No ratings yet
III-II CSE_ML MID 2_Set-2
1 page
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
2
No ratings yet
2
23 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
M 2
No ratings yet
M 2
12 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Full Ebook of A First Course in Ordinary Differential Equations 1St Edition Suman Kumar Tumuluri Online PDF All Chapter
No ratings yet
Full Ebook of A First Course in Ordinary Differential Equations 1St Edition Suman Kumar Tumuluri Online PDF All Chapter
69 pages
MWOC_JanuaryFebruary-2023
No ratings yet
MWOC_JanuaryFebruary-2023
2 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
SECX1029-UNIT-5
No ratings yet
SECX1029-UNIT-5
15 pages
Quiz 01aae Taylorseries Answers
No ratings yet
Quiz 01aae Taylorseries Answers
7 pages
Nonlinear Systems and Control Lecture # 19 Perturbed Systems & Input-to-State Stability
No ratings yet
Nonlinear Systems and Control Lecture # 19 Perturbed Systems & Input-to-State Stability
18 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
MWOC_JulyAugust-2022
No ratings yet
MWOC_JulyAugust-2022
1 page
file_5e6f419ab8dd6
No ratings yet
file_5e6f419ab8dd6
3 pages
Lec 09
No ratings yet
Lec 09
51 pages
ESci 110 - N046 - Lesson 5.3 Assessment
No ratings yet
ESci 110 - N046 - Lesson 5.3 Assessment
3 pages
III-II Ece - Esd Mid 2 - Set-1
No ratings yet
III-II Ece - Esd Mid 2 - Set-1
1 page
ADE-1
No ratings yet
ADE-1
3 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
III-II Ece - Esd Mid 2 - Set-2
No ratings yet
III-II Ece - Esd Mid 2 - Set-2
2 pages
4.3 Graphs: Edges and Paths
No ratings yet
4.3 Graphs: Edges and Paths
12 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
ALGEBRA I - Course Syllabus: Instructor Martin Keller Phone School E-Mail Course Description
No ratings yet
ALGEBRA I - Course Syllabus: Instructor Martin Keller Phone School E-Mail Course Description
4 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
B.SC 4th Sem Maths 1st Book Objective Questions
No ratings yet
B.SC 4th Sem Maths 1st Book Objective Questions
28 pages
Analysis of Algorithms
No ratings yet
Analysis of Algorithms
4 pages
PTSP QP March 2022
No ratings yet
PTSP QP March 2022
1 page
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Bs Lab Manual
100% (1)
Bs Lab Manual
116 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
TCS National Qualifier Test Preparation
No ratings yet
TCS National Qualifier Test Preparation
2 pages
jacobi method (L10)
No ratings yet
jacobi method (L10)
7 pages
8f81 PDF
No ratings yet
8f81 PDF
48 pages
St. Xavier High School Test Paper
0% (1)
St. Xavier High School Test Paper
4 pages
Data Structure Final PART I
No ratings yet
Data Structure Final PART I
43 pages
Part - I: Pre Rmo: Equations
No ratings yet
Part - I: Pre Rmo: Equations
9 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Data Structure Final PART II
No ratings yet
Data Structure Final PART II
50 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
AprilMay-2023
No ratings yet
AprilMay-2023
2 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
4th PT Finals
No ratings yet
4th PT Finals
3 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Examples in C++
No ratings yet
Examples in C++
39 pages
R22 ECE EMI Syllabus
No ratings yet
R22 ECE EMI Syllabus
2 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
ICA Unit 4
No ratings yet
ICA Unit 4
34 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Shanghai American School: Mathematics Department
No ratings yet
Shanghai American School: Mathematics Department
5 pages
C21 - CS - Iii Sem
No ratings yet
C21 - CS - Iii Sem
85 pages
DE DLD Syllabus
No ratings yet
DE DLD Syllabus
4 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Engg Math 1 - Module 1 Algebra
No ratings yet
Engg Math 1 - Module 1 Algebra
35 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Signals and Systems 03
No ratings yet
Signals and Systems 03
8 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Discrete Mathematics
No ratings yet
Discrete Mathematics
115 pages
15cs204j-Algorithm Design and Analysis
No ratings yet
15cs204j-Algorithm Design and Analysis
3 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
SampleFINAL KEY 205 Taylor
No ratings yet
SampleFINAL KEY 205 Taylor
9 pages
Algebra 2 Answer Key PDF
67% (6)
Algebra 2 Answer Key PDF
411 pages
WEEK 8.1 Representations of Logarithmic Functions
No ratings yet
WEEK 8.1 Representations of Logarithmic Functions
19 pages
JEE MAIN and ADVANCED Chapterwise PYQ Mathematics Prabhat Publication
100% (12)
JEE MAIN and ADVANCED Chapterwise PYQ Mathematics Prabhat Publication
445 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Machine Learning

Uploaded by

Machine Learning

Uploaded by

Lecture 5: Markov Decision Processes

1 Recap and Introduction

2 Markov Decision Processes

Figure 1: A schematic of a Markov decision process

Here the basic objects are:

3.1 Policies, Value functions, and Objective

The objective in an MDP is to find a policy π that maximizes J(π).

Vhπ (s) = E rh + Vh+1

∀(s, h) : πh? (s) = argmax r(s, a) + Es0 ∼P (s,a) Vh+1 (s0 )

Then we can simply define the optimal policy as πh?

= E [Vh? (sh ) | π̃] (Definition of Vh? )

3.3 Planning algorithms for the episodic setting

4 Infinite horizon discounted setting

V ? (s) = max r(s, a) + γEs0 ∼P (s,a) [V ? (s0 )] (4)

4.1 Planning algorithms

with a similar calculation for the other case.

Proof. Consider state s and let a = πf (s) = argmaxa0 f (s, a0 ). Then

V ? (s) − V πf (s) = Q? (s, π ? (s)) − Q? (s, a) + Q? (s, a) − Qπf (s, a)

You might also like