0% found this document useful (0 votes)

17 views89 pages

06 MDP

Uploaded by

SAMEER REDDY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views89 pages

06 MDP

Uploaded by

SAMEER REDDY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

Markov Decision Processes

Deep Reinforcement Learning

2013 Atari (DQN)
[Deepmind]

Pong Enduro Beamrider Q*bert

Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

AlphaGo Silver et al, Nature 2015

AlphaGoZero Silver et al, Nature 2017
AlphaZero Silver et al, 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

2016 3D locomotion (TRPO+GAE)

[Berkeley]

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]

Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

2016 3D locomotion (TRPO+GAE)

[Berkeley]

2016 Real Robot Manipulation (GPS)

[Berkeley]

[Levine, Finn, Darrell, Abbeel, JMLR 2016]

Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

2016 3D locomotion (TRPO+GAE)

[Berkeley]

2016 Real Robot Manipulation (GPS)

[Berkeley]

2019 Rubik’s Cube (PPO+DR) OpenAI

[OpenAI]
Non-Deterministic Search
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement: actions do not always go as

planned
▪ 80% of the time, the action North takes the agent
North (if there is no wall there)
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have
been taken, the agent stays put

▪ The agent receives rewards

▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)

▪ Goal: maximize sum of rewards

Grid World Actions
Deterministic Grid World Stochastic Grid World
Markov Decision Processes
o An MDP is defined by:
o A set of states s  S
o A set of actions a  A
o A transition function T(s, a, s’)
o Probability that a from s leads to s’, i.e., P(s’| s, a)
o Also called the model or the dynamics
o A reward function R(s, a, s’)
o Sometimes just R(s) or R(s’)
o A start state
o Maybe a terminal state
What is Markov about MDPs?
o “Markov” generally means that given the present state, the
future and the past are independent

o For Markov decision processes, “Markov” means action

outcomes depend only on the current state

Andrey Markov
(1856-1922)

o This is just like search, where the successor function could

only depend on the current state (not the history)
Policies
o In deterministic single-agent search
problems, we wanted an optimal plan, or
sequence of actions, from start to a goal

o For MDPs, we want an optimal

policy *: S → A
o A policy  gives an action for each state
o An optimal policy is one that maximizes
expected utility if followed
Optimal policy when R(s, a, s’) = -0.03
o An explicit policy defines a reflex agent
for all non-terminals s
Optimal Policies
Example: Racing
o A robot car wants to travel far, quickly
o Three states: Cool, Warm, Overheated
o Two actions: Slow, Fast
0.5 +1
o Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
o Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-
s, a
state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)

R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
o What preferences should an agent have over reward sequences?

o More or less? [1, 2, 2] or [2, 3, 4]

o Now or later? [0, 0, 1] or [1, 0, 0]

Discounting
o It’s reasonable to maximize the sum of rewards
o It’s also reasonable to prefer rewards now to rewards later
o One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Discounting
o How to discount?
o Each time we descend a level, we
multiply in the discount once

o Why discount?
o Reward now is better than later
o Can also think of it as a 1-gamma
chance of ending the process at
every step
o Also helps our algorithms
converge

o Example: discount of 0.5

o U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
o U([1,2,3]) < U([3,2,1])
21
Compute Probability

22
23
24
25
26
27
28
29
30
31
Rewards and Utility

32
Rewards and Utility

33
Discounted Rewards and Utility

34
Stochastic World

35
Stochastic World
o the utility in stochastic environment as

o where E is the expectation with respect to the probability

distribution over the possible paths following a certain policy.

36
Stochastic World
o With stochastic, the utility of a state under a certain policy can
also be represented as the sum of its immediate reward and the
utility of its successor state, of course, following a probability
distribution.

37
38
Example: Racing
Example: Racing
o A robot car wants to travel far, quickly
o Three states: Cool, Warm, Overheated
o Two actions: Slow, Fast
0.5 +1
o Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
o Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-
s, a
state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)

R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
o What preferences should an agent have over reward sequences?

o More or less? [1, 2, 2] or [2, 3, 4]

o Now or later? [0, 0, 1] or [1, 0, 0]

Discounting
o It’s reasonable to maximize the sum of rewards
o It’s also reasonable to prefer rewards now to rewards later
o One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Discounting
o How to discount?
o Each time we descend a level, we
multiply in the discount once

o Why discount?
o Reward now is better than later
o Can also think of it as a 1-gamma
chance of ending the process at
every step
o Also helps our algorithms
converge

o Example: discount of 0.5

o U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
o U([1,2,3]) < U([3,2,1])
Stationary Preferences
o Theorem: if we assume stationary preferences:

o Then: there are only two ways to define utilities

o Additive utility:
o Discounted utility:
Quiz: Discounting
o Given:

o Actions: East, West, and Exit (only available in exit states a, e)

o Transitions: deterministic

o Quiz 1: For  = 1, what is the optimal policy? <- <- <-

o Quiz 2: For  = 0.1, what is the optimal policy? <- <- ->

o Quiz 3: For which  are West and East equally good when in state d?
10 3
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite
rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies ( depends on time left)

▪ Discounting: use 0 <  < 1

▪ Smaller  means smaller “horizon” – shorter term focus

▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
Recap: Defining MDPs
o Markov decision processes: s
o Set of states S
o Start state s0 a
o Set of actions A s, a
o Transitions P(s’|s,a) (or T(s,a,s’))
o Rewards R(s,a,s’) (and discount ) s,a,s’
s’

o MDP quantities so far:

o Policy = Choice of action for each state
o Utility = sum of (discounted) rewards
Solving MDPs
Recall: Racing MDP
o A robot car wants to travel far, quickly
o Three states: Cool, Warm, Overheated
o Two actions: Slow, Fast
0.5 +1
o Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
Racing Search Tree
o We’re doing way too much
work with expectimax!

o Problem: States are repeated

o Idea: Only compute needed
quantities once

o Problem: Tree goes on

forever
o Idea: Do a depth-limited
computation, but with
increasing depths until change
is small
o Note: deep parts of the tree
eventually don’t matter if γ < 1
Optimal Quantities

▪ The value (utility) of a state s:

V*(s) = expected utility starting in s s s is a
state
and acting optimally
a
(s, a) is a
▪ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s
transition
and (thereafter) acting optimally s’

▪ The optimal policy:

*(s) = optimal action from state s
Gridworld V* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Values of States (Bellman Eqns)
o Recursive definition of value:
s

a
s, a

s,a,s’
s’
Policy Evaluation

60
Policy Improvement

61
k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
o In policy iteration, we iteratively alternate policy evaluation and
policy improvement. In policy evaluation, we keep policy
constant and update utility based on that policy.

o In policy improvement, we keep utility constant and update

policy based on that utility.

76
Policy Iteration
o The whole process of policy iteration is then

o We start with any arbitrary policy π₁, gets its utility v₁ by policy
evaluation, gets a new policy π₂ from v₁ by policy improvement,
gets utility v₂ of π₂ by policy evaluation, … until we converge on
our optimal policy π*.

77
Time-Limited Values
o Key idea: time-limited values

o Define Vk(s) to be the optimal value of s if the game

ends in k more time steps
o Equivalently, it’s what a depth-k expectimax would give
from s
Computing Time-Limited Values
Value Iteration
Value Iteration
o Start with V0(s) = 0: no time steps left means an expected reward sum of zero
o Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

o Repeat until convergence, which yields V* s,a,s’

Vk(s’)

o Complexity of each iteration: O(S2A)

o Theorem: will converge to unique optimal values

o Basic idea: approximations get refined towards optimal values
o Policy may converge long before values do
82
83
84
85
86
87
88
Convergence*
o How do we know the Vk vectors are going to
converge? (assuming 0 < γ < 1)

o Proof Sketch:
o For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
o The difference is that on the bottom layer, Vk+1 has
actual rewards while Vk has zeros
o That last layer is at best all RMAX
o It is at worst RMIN
o But everything is discounted by γk that far out
o So Vk and Vk+1 are at most γk max|R| different
o So as k increases, the values converge

Aipython
No ratings yet
Aipython
350 pages
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
2 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
RL-Dynamic Pricing E-Com Report
No ratings yet
RL-Dynamic Pricing E-Com Report
80 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
A Brief Survey of Deep Reinforcement Learning
No ratings yet
A Brief Survey of Deep Reinforcement Learning
16 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Simulation-Based Optimization Parametric Optimizat
100% (1)
Simulation-Based Optimization Parametric Optimizat
11 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Lec 08
No ratings yet
Lec 08
59 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
(2205.08936) 内容
No ratings yet
(2205.08936) 内容
53 pages
Lec 09
No ratings yet
Lec 09
51 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
RL 1
No ratings yet
RL 1
30 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
Ebooks File Applied Reinforcement Learning With Python: With OpenAI Gym, Tensorflow, and Keras Beysolow Ii All Chapters
100% (9)
Ebooks File Applied Reinforcement Learning With Python: With OpenAI Gym, Tensorflow, and Keras Beysolow Ii All Chapters
62 pages
Lecture 9 - RL
No ratings yet
Lecture 9 - RL
82 pages
Unit V Reinforcement Learning and Genetic Algorithm
No ratings yet
Unit V Reinforcement Learning and Genetic Algorithm
40 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Special Elective PH.D
No ratings yet
Special Elective PH.D
84 pages
15 MDP
No ratings yet
15 MDP
35 pages
A Deep Q-Learning Portfolio Management Framework For The Cryptocurrency Market
No ratings yet
A Deep Q-Learning Portfolio Management Framework For The Cryptocurrency Market
16 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning Syllabus
No ratings yet
Reinforcement Learning Syllabus
1 page
Slides
No ratings yet
Slides
10 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Limited or Biased: Modeling Sub-Rational Human Investors in Financial Markets
No ratings yet
Limited or Biased: Modeling Sub-Rational Human Investors in Financial Markets
33 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
E1 277 January-April 3:1 Reinforcement Learning: Instructor
No ratings yet
E1 277 January-April 3:1 Reinforcement Learning: Instructor
2 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Reinforcement Learning A LiteratureReview v2
No ratings yet
Reinforcement Learning A LiteratureReview v2
37 pages
Pldi25 Paper313
No ratings yet
Pldi25 Paper313
22 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Artificial Intelligence A-Z™ 2023 Build An AI With
No ratings yet
Artificial Intelligence A-Z™ 2023 Build An AI With
19 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
RL Catalogue
No ratings yet
RL Catalogue
3 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Life-Inspired Interoceptive Artificial Intelligence
No ratings yet
Life-Inspired Interoceptive Artificial Intelligence
28 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Policies, Search, Utility
No ratings yet
Policies, Search, Utility
13 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
RL - Unit III
No ratings yet
RL - Unit III
12 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
UT Dallas Syllabus For cs4375.501 06f Taught by Yu Chung NG (Ycn041000)
No ratings yet
UT Dallas Syllabus For cs4375.501 06f Taught by Yu Chung NG (Ycn041000)
6 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Machine Learning Mod 5
No ratings yet
Machine Learning Mod 5
15 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Information Theoretic Principles For Agent Learning
No ratings yet
Information Theoretic Principles For Agent Learning
9 pages
Bellman Equation
No ratings yet
Bellman Equation
13 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Asen 5519-003 Dmu
No ratings yet
Asen 5519-003 Dmu
4 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet