(24F-COSE361) 5. Markov Decision Process
(24F-COSE361) 5. Markov Decision Process
Human Rationality
Oops Whew!
Preferences
▪ Counterexample
▪ An agent with intransitive preferences can lead to iterative loss
▪ e.g., Rock Paper Scissors
Maximum Expected Utility (MEU)
▪ Rational preferences
▪ Imply behaviors describable as maximization of expected utility
▪ i.e., values assigned by U preserve preferences of both certain and uncertain outcomes
Umbrella
leave take
U
Weather
sun rain
Forecast
bad good
Decision Network
▪ Maximum Expected Utility (MEU)
▪ Choose the action that maximizes the expected utility given the evidence
Umbrella = leave
A W U(A,W)
Umbrella
leave sun 100
leave rain 0
U
take sun 20
Umbrella = take take rain 70
Weather
W P(W)
sun 0.7
Optimal decision = leave rain 0.3
Decisions as Outcome Trees
No evidence
{}
Umbrella
U Weather | {} Weather | {}
Weather
Umbrella
U W | {b} W | {b}
Weather
Forecast
=bad
▪ Almost exactly like expectimax, but… what’s changed?
▪ Computation of probabilities at each uncertain node
Value of Perfect Information (VPI)
▪ MEU(e) = Maximum Expected Utility, given the known evidence E=e
▪ We assume that the evidence e is known (i.e., nodes we know)
▪ Calculating MEU requires taking a maximum over several expectations (i.e., one EU per action)
▪ VPI(E'|e) = Expected gain in utility for knowing a new E', given the known e so far
▪ E’: the random variable(s) we want to know the value of (i.e., a new evidence to reveal)
▪ e: the random variable(s) we already know the value of (i.e., a known evidence)
▪ Calculating VPI requires taking an expectation over several MEUs
▪ i.e., one MEU per possible outcome of E', because we don’t know the value of E'
Value of Perfect Information (VPI)
{+e}
▪ Assume we have evidence E=e. Value of MEU if we act now: a
P(s | +e)
U
▪ Assume we see that E’ = e’. Value of MEU if we act then: {+e, +e’}
a
U
▪ Value of perfect information:
▪ How much MEU goes up by revealing E’ first then acting, over acting now:
VPI Example
A W U
MEU with no evidence Umbrella leave sun 100
leave rain 0
U
W P(W) take sun 20
MEU if Forecast is bad (optimal decision = take) sun 0.7 Weather take rain 70
rain 0.3
W P(W|F=bad) W P(W|F=good)
sun 0.34 sun 0.95
MEU if Forecast is good (optimal decision = leave) rain 0.66 rain 0.05
Forecast
A0 A1 A2
R0 R1 R2
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path
▪ Noisy movement
▪ Actions do not always go as planned
▪ 80% of the time, North → North
▪ 10% of the time, North → West or North → East
▪ If there is a wall in the direction of movement, the agent stays put
Up
Up
Left Up Right
Markov Decision Process
▪ An MDP is defined by: Terminal
▪ Set of states s S
▪ Set of actions a A
▪ Transition function T(s, a, s’) Terminal
▪ Probability that a at s leads to s’, i.e., P(s’| s, a) T, R
▪ Also called the model or the dynamics
▪ Reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’) Start
▪ Start state Action
▪ Terminal state (optional)
▪ Action outcomes depend only on the current state (i.e., Markov property)
The living reward R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0
(Maximum time step exists)
Example: Optimal Policies in Grid World?
s s: state
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once
▪ Why discount?
▪ Sooner rewards probably do have
higher utility than later rewards
▪ Also helps our algorithms converge
▪ Quiz 3: For which are West and East equally good when in state d?
Summary: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount ) s,a,s’
s’