0% found this document useful (0 votes)
17 views89 pages

06 MDP

Uploaded by

SAMEER REDDY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views89 pages

06 MDP

Uploaded by

SAMEER REDDY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Markov Decision Processes

Deep Reinforcement Learning


2013 Atari (DQN)
[Deepmind]

Pong Enduro Beamrider Q*bert


Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

AlphaGo Silver et al, Nature 2015


AlphaGoZero Silver et al, Nature 2017
AlphaZero Silver et al, 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

2016 3D locomotion (TRPO+GAE)


[Berkeley]

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]


Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

2016 3D locomotion (TRPO+GAE)


[Berkeley]

2016 Real Robot Manipulation (GPS)


[Berkeley]

[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]


Deep Reinforcement Learning
2013 Atari (DQN)
[Deepmind]

2015 AlphaGo
[Deepmind]

2016 3D locomotion (TRPO+GAE)


[Berkeley]

2016 Real Robot Manipulation (GPS)


[Berkeley]

2019 Rubik’s Cube (PPO+DR) OpenAI


[OpenAI]
Non-Deterministic Search
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement: actions do not always go as


planned
▪ 80% of the time, the action North takes the agent
North (if there is no wall there)
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have
been taken, the agent stays put

▪ The agent receives rewards


▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)

▪ Goal: maximize sum of rewards


Grid World Actions
Deterministic Grid World Stochastic Grid World
Markov Decision Processes
o An MDP is defined by:
o A set of states s  S
o A set of actions a  A
o A transition function T(s, a, s’)
o Probability that a from s leads to s’, i.e., P(s’| s, a)
o Also called the model or the dynamics
o A reward function R(s, a, s’)
o Sometimes just R(s) or R(s’)
o A start state
o Maybe a terminal state
What is Markov about MDPs?
o “Markov” generally means that given the present state, the
future and the past are independent

o For Markov decision processes, “Markov” means action


outcomes depend only on the current state

Andrey Markov
(1856-1922)

o This is just like search, where the successor function could


only depend on the current state (not the history)
Policies
o In deterministic single-agent search
problems, we wanted an optimal plan, or
sequence of actions, from start to a goal

o For MDPs, we want an optimal


policy *: S → A
o A policy  gives an action for each state
o An optimal policy is one that maximizes
expected utility if followed
Optimal policy when R(s, a, s’) = -0.03
o An explicit policy defines a reflex agent
for all non-terminals s
Optimal Policies
Example: Racing
o A robot car wants to travel far, quickly
o Three states: Cool, Warm, Overheated
o Two actions: Slow, Fast
0.5 +1
o Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
o Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-
s, a
state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)


R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
o What preferences should an agent have over reward sequences?

o More or less? [1, 2, 2] or [2, 3, 4]

o Now or later? [0, 0, 1] or [1, 0, 0]


Discounting
o It’s reasonable to maximize the sum of rewards
o It’s also reasonable to prefer rewards now to rewards later
o One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps


Discounting
o How to discount?
o Each time we descend a level, we
multiply in the discount once

o Why discount?
o Reward now is better than later
o Can also think of it as a 1-gamma
chance of ending the process at
every step
o Also helps our algorithms
converge

o Example: discount of 0.5


o U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
o U([1,2,3]) < U([3,2,1])
21
Compute Probability

22
23
24
25
26
27
28
29
30
31
Rewards and Utility

32
Rewards and Utility

33
Discounted Rewards and Utility

34
Stochastic World

35
Stochastic World
o the utility in stochastic environment as

o where E is the expectation with respect to the probability


distribution over the possible paths following a certain policy.

36
Stochastic World
o With stochastic, the utility of a state under a certain policy can
also be represented as the sum of its immediate reward and the
utility of its successor state, of course, following a probability
distribution.

37
38
Example: Racing
Example: Racing
o A robot car wants to travel far, quickly
o Three states: Cool, Warm, Overheated
o Two actions: Slow, Fast
0.5 +1
o Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
o Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-
s, a
state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)


R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
o What preferences should an agent have over reward sequences?

o More or less? [1, 2, 2] or [2, 3, 4]

o Now or later? [0, 0, 1] or [1, 0, 0]


Discounting
o It’s reasonable to maximize the sum of rewards
o It’s also reasonable to prefer rewards now to rewards later
o One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps


Discounting
o How to discount?
o Each time we descend a level, we
multiply in the discount once

o Why discount?
o Reward now is better than later
o Can also think of it as a 1-gamma
chance of ending the process at
every step
o Also helps our algorithms
converge

o Example: discount of 0.5


o U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
o U([1,2,3]) < U([3,2,1])
Stationary Preferences
o Theorem: if we assume stationary preferences:

o Then: there are only two ways to define utilities


o Additive utility:
o Discounted utility:
Quiz: Discounting
o Given:

o Actions: East, West, and Exit (only available in exit states a, e)


o Transitions: deterministic

o Quiz 1: For  = 1, what is the optimal policy? <- <- <-

o Quiz 2: For  = 0.1, what is the optimal policy? <- <- ->

o Quiz 3: For which  are West and East equally good when in state d?
10 3
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite
rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies ( depends on time left)

▪ Discounting: use 0 <  < 1

▪ Smaller  means smaller “horizon” – shorter term focus

▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
Recap: Defining MDPs
o Markov decision processes: s
o Set of states S
o Start state s0 a
o Set of actions A s, a
o Transitions P(s’|s,a) (or T(s,a,s’))
o Rewards R(s,a,s’) (and discount ) s,a,s’
s’

o MDP quantities so far:


o Policy = Choice of action for each state
o Utility = sum of (discounted) rewards
Solving MDPs
Recall: Racing MDP
o A robot car wants to travel far, quickly
o Three states: Cool, Warm, Overheated
o Two actions: Slow, Fast
0.5 +1
o Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
Racing Search Tree
o We’re doing way too much
work with expectimax!

o Problem: States are repeated


o Idea: Only compute needed
quantities once

o Problem: Tree goes on


forever
o Idea: Do a depth-limited
computation, but with
increasing depths until change
is small
o Note: deep parts of the tree
eventually don’t matter if γ < 1
Optimal Quantities

▪ The value (utility) of a state s:


V*(s) = expected utility starting in s s s is a
state
and acting optimally
a
(s, a) is a
▪ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s
transition
and (thereafter) acting optimally s’

▪ The optimal policy:


*(s) = optimal action from state s
Gridworld V* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Values of States (Bellman Eqns)
o Recursive definition of value:
s

a
s, a

s,a,s’
s’
Policy Evaluation

60
Policy Improvement

61
k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
o In policy iteration, we iteratively alternate policy evaluation and
policy improvement. In policy evaluation, we keep policy
constant and update utility based on that policy.

o In policy improvement, we keep utility constant and update


policy based on that utility.

76
Policy Iteration
o The whole process of policy iteration is then

o We start with any arbitrary policy π₁, gets its utility v₁ by policy
evaluation, gets a new policy π₂ from v₁ by policy improvement,
gets utility v₂ of π₂ by policy evaluation, … until we converge on
our optimal policy π*.

77
Time-Limited Values
o Key idea: time-limited values

o Define Vk(s) to be the optimal value of s if the game


ends in k more time steps
o Equivalently, it’s what a depth-k expectimax would give
from s
Computing Time-Limited Values
Value Iteration
Value Iteration
o Start with V0(s) = 0: no time steps left means an expected reward sum of zero
o Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

o Repeat until convergence, which yields V* s,a,s’


Vk(s’)

o Complexity of each iteration: O(S2A)

o Theorem: will converge to unique optimal values


o Basic idea: approximations get refined towards optimal values
o Policy may converge long before values do
82
83
84
85
86
87
88
Convergence*
o How do we know the Vk vectors are going to
converge? (assuming 0 < γ < 1)

o Proof Sketch:
o For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
o The difference is that on the bottom layer, Vk+1 has
actual rewards while Vk has zeros
o That last layer is at best all RMAX
o It is at worst RMIN
o But everything is discounted by γk that far out
o So Vk and Vk+1 are at most γk max|R| different
o So as k increases, the values converge

You might also like