0% found this document useful (0 votes)
41 views9 pages

Lecture7 MDPs I

This document discusses Markov decision processes (MDPs) and solving them using techniques like expectimax search. It provides an example of an MDP called Grid World that involves an agent moving in a grid with the possibility of noisy or stochastic movement. MDPs are non-deterministic search problems that can be solved using expectimax search. The goal is to find an optimal policy that maximizes expected utility or rewards over time. Discounting future rewards is important to prefer rewards received sooner rather than later.

Uploaded by

Mamunur Rashid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views9 pages

Lecture7 MDPs I

This document discusses Markov decision processes (MDPs) and solving them using techniques like expectimax search. It provides an example of an MDP called Grid World that involves an agent moving in a grid with the possibility of noisy or stochastic movement. MDPs are non-deterministic search problems that can be solved using expectimax search. The goal is to find an optimal policy that maximizes expected utility or rewards over time. Discounting future rewards is important to prefer rewards received sooner rather than later.

Uploaded by

Mamunur Rashid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Artificial Intelligence Non-Deterministic Search

Markov Decision Processes

Example: Grid World Grid World Actions


Deterministic Grid World Stochastic Grid World
§ A maze-like problem
§ The agent lives in a grid
§ Walls block the agent’s path

§ Noisy movement: actions do not always go as planned


§ 80% of the time, the action North takes the agent North
(if there is no wall there)
§ 10% of the time, North takes the agent West; 10% East
§ If there is a wall in the direction the agent would have
been taken, the agent stays put

§ The agent receives rewards each time step


§ Small “living” reward each step (can be negative)
§ Big rewards come at the end (good or bad)

§ Goal: maximize sum of rewards

Markov Decision Processes What is Markov about MDPs?


§ An MDP is defined by: § “Markov” generally means that given the present state, the
§ A set of states s  S future and the past are independent
§ A set of actions a  A
§ A transition function T(s, a, s’)
§ Probability that a from s leads to s’, i.e., P(s’| s, a) § For Markov decision processes, “Markov” means action
§ Also called the model or the dynamics outcomes depend only on the current state
§ A reward function R(s, a, s’)
§ Sometimes just R(s) or R(s’)
§ A start state
§ Maybe a terminal state
Andrey Markov
(1856-1922)
§ MDPs are non-deterministic search problems
§ One way to solve them is with expectimax search § This is just like search, where the successor function could only
§ We’ll have a new tool soon
depend on the current state (not the history)

[Demo – gridworld manual intro (L8D1)]

1
Policies Video of Demo Gridworld Manual Intro
§ In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal

§ For MDPs, we want an optimal policy *: S → A


§ A policy  gives an action for each state
§ An optimal policy is one that maximizes
expected utility if followed
§ An explicit policy defines a reflex agent
Optimal policy when R(s, a, s’) = -0.03
for all non-terminals s
§ Expectimax didn’t compute entire policies
§ It computed the action for a single state only

Example: Racing Example: Racing


§ A robot car wants to travel far, quickly
§ Three states: Cool, Warm, Overheated
§ Two actions: Slow, Fast
0.5 +1
§ Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow
Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2

Racing Search Tree MDP Search Trees


§ Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a
s, a
q-state
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
s’

2
Utilities of Sequences Utilities of Sequences
§ What preferences should an agent have over reward sequences?

§ More or less? [1, 2, 2] or [2, 3, 4]

§ Now or later? [0, 0, 1] or [1, 0, 0]

Discounting Discounting
§ It’s reasonable to maximize the sum of rewards § How to discount?
§ It’s also reasonable to prefer rewards now to rewards later § Each time we descend a level, we
multiply in the discount once
§ One solution: values of rewards decay exponentially
§ Why discount?
§ Sooner rewards probably do have
higher utility than later rewards
§ Also helps our algorithms converge

§ Example: discount of 0.5


§ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
Worth Now Worth Next Step Worth In Two Steps
§ U([1,2,3]) < U([3,2,1])

Stationary Preferences Quiz: Discounting


§ Theorem: if we assume stationary preferences: § Given:

§ Actions: East, West, and Exit (only available in exit states a, e)


§ Transitions: deterministic

§ Quiz 1: For  = 1, what is the optimal policy?


§ Then: there are only two ways to define utilities
§ Quiz 2: For  = 0.1, what is the optimal policy?
§ Additive utility:
§ Discounted utility: § Quiz 3: For which  are West and East equally good when in state d?

3
Infinite Utilities?! Recap: Defining MDPs
§ Problem: What if the game lasts forever? Do we get infinite rewards? § Markov decision processes: s
§ Solutions: § Set of states S
§ Start state s0 a
§ Finite horizon: (similar to depth-limited search)
§ Terminate episodes after a fixed T steps (e.g. life) § Set of actions A s, a
§ Gives nonstationary policies ( depends on time left) § Transitions P(s’|s,a) (or T(s,a,s’))
§ Rewards R(s,a,s’) (and discount ) s,a,s’
§ Discounting: use 0 <  < 1 s’

§ MDP quantities so far:


§ Smaller  means smaller “horizon” – shorter term focus § Policy = Choice of action for each state
§ Absorbing state: guarantee that for every policy, a terminal state will eventually § Utility = sum of (discounted) rewards
be reached (like “overheated” for racing)

Solving MDPs Optimal Quantities

§ The value (utility) of a state s:


V*(s) = expected utility starting in s and s s is a
state
acting optimally
a
(s, a) is a
§ The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s and
transition
(thereafter) acting optimally s’

§ The optimal policy:


*(s) = optimal action from state s
[Demo – gridworld values (L8D4)]

Snapshot of Demo – Gridworld V Values Snapshot of Demo – Gridworld Q Values

Noise = 0 Noise = 0
Discount = 1 Discount = 1
Living reward = 0 Living reward = 0

4
Snapshot of Demo – Gridworld V Values Snapshot of Demo – Gridworld Q Values

Noise = 0.2 Noise = 0.2


Discount = 1 Discount = 1
Living reward = 0 Living reward = 0

Snapshot of Demo – Gridworld V Values Snapshot of Demo – Gridworld Q Values

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0

Values of States Racing Search Tree


§ Fundamental operation: compute the (expectimax) value of a state
§ Expected utility under optimal action
s
§ Average sum of (discounted) rewards
§ This is just what expectimax computed! a
s, a
§ Recursive definition of value:
s,a,s’
s’

5
Racing Search Tree Racing Search Tree
§ We’re doing too much work
with expectimax!

§ Problem: States are repeated


§ Idea: Only compute needed
quantities once

§ Problem: Tree goes on forever


§ Idea: Do a depth-limited
computation, but with increasing
depths until change is small
§ Note: deep parts of the tree
eventually don’t matter if γ < 1

Time-Limited Values k=0


§ Key idea: time-limited values

§ Define Vk(s) to be the optimal value of s if the game ends


in k more time steps
§ Equivalently, it’s what a depth-k expectimax would give from s

Noise = 0.2
Discount = 0.9
[Demo – time-limited values (L8D6)] Living reward = 0

k=1 k=2

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0

6
k=3 k=4

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0

k=5 k=6

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0

k=7 k=8

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0

7
k=9 k=10

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0

k=11 k=12

Noise = 0.2 Noise = 0.2


Discount = 0.9 Discount = 0.9
Living reward = 0 Living reward = 0

k=100 Computing Time-Limited Values

Noise = 0.2
Discount = 0.9
Living reward = 0

8
Value Iteration Value Iteration
§ Start with V0(s) = 0: no time steps left means an expected reward sum of zero
§ Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

§ Repeat until convergence s,a,s’


Vk(s’)

§ Complexity of each iteration: O(S2A)

§ Theorem: will converge to unique optimal values


§ Basic idea: approximations get refined towards optimal values
§ Policy may converge long before values do

Example: Value Iteration Convergence*


§ How do we know the Vk vectors are going to converge?

§ Case 1: If the tree has maximum depth M, then VM holds


3.5 2.5 0 the actual untruncated values

§ Case 2: If the discount is less than 1


§ Sketch: For any state Vk and Vk+1 can be viewed as depth
k+1 expectimax results in nearly identical search trees
2 1 0 § The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
Assume no discount! § That last layer is at best all RMAX
§ It is at worst RMIN
0 0 0 § But everything is discounted by γk that far out
§ So Vk and Vk+1 are at most γk max|R| different
§ So as k increases, the values converge

Next Time: Policy-Based Methods

You might also like