0% found this document useful (0 votes)
29 views40 pages

(24F-COSE361) 5. Markov Decision Process

The document discusses Markov Decision Processes (MDPs) within the context of artificial intelligence, focusing on rational decision-making and the structure of decision networks. It outlines the components of MDPs, including states, actions, transition functions, and reward functions, and emphasizes the importance of maximizing expected utility. Additionally, it explores the concept of policies in MDPs and how they can be used to determine optimal actions in uncertain environments.

Uploaded by

cana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views40 pages

(24F-COSE361) 5. Markov Decision Process

The document discusses Markov Decision Processes (MDPs) within the context of artificial intelligence, focusing on rational decision-making and the structure of decision networks. It outlines the components of MDPs, including states, actions, transition functions, and reward functions, and emphasizes the importance of maximizing expected utility. Additionally, it explores the concept of policies in MDPs and how they can be used to determine optimal actions in uncertain environments.

Uploaded by

cana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

COSE361: Artificial Intelligence

Markov Decision Processes (1)

[Adapted from the original slides from CS188 Intro to AI at UC Berkeley]


Outline
▪ Rational Decision
▪ Decision Network
▪ Markov Decision Process
RECAP: What should we build?
Machines with artificial intelligence is to …
Think

Think like people? Think rationally?


Cognitive/Neuroscience Laws of thought process
→ Natural intelligence? → Unknown and uncertainties?

Human Rationality

Act like people? Act rationally?


Turing test (Alan Turing) Agent (i.e., sth that acts, to do)
→ Mistakes with a human touch? → to achieve the best outcome!
(인간미 있는 실수)
Act
RECAP: In a nutshell
▪ Artificial Intelligence has focused on the study and construction
of rational agents that do the right thing.
▪ The right thing is defined by the objective that we provide to the agent.
Recap: Rational Decisions
▪ We’ll use the term rational in a very specific, technical way:
▪ Rational: maximally achieving pre-defined goals
▪ Goals are expressed in terms of the utility of outcomes
▪ Utilities describe an agent’s preferences for the outcomes (states of the world)
▪ Utilities are quantified as functions of outcomes → real numbers
Recap: Rational Decisions
▪ We’ll use the term rational in a very specific, technical way:
▪ Rational: maximally achieving pre-defined goals
▪ Goals are expressed in terms of the utility of outcomes
▪ World is uncertain, so we may want to use expected utility
▪ Being rational means acting to maximize your expected utility
Utilities

▪ F(outcomes (states of the world)) = real numbers (rewards or incentives)


▪ Utilities summarize the agent’s preferences on outcomes
▪ In a game, utility may be simple scores (e.g., 300 pts or -100 pts)
▪ Theorem: any “rational” preferences can be summarized as a utility function

▪ We preset utilities and let behaviors emerge


▪ Why don’t we prescribe behaviors or let agents pick utilities?
▪ Safety, adaptability, and practicality
Utilities with Uncertain Outcomes
Getting ice cream

Get Single Get Double

Oops Whew!
Preferences

▪ Notation: Certain outcome Uncertain outcome


▪ Outcomes: A, B
A
▪ Uncertain outcomes: [p, A; (1-p), B]
p 1-p
▪ Preference:
▪ Indifference:
A B

▪ An agent must have preferences among outcomes


▪ e.g., preferences on ice-creams prizes/lottery
▪ Certain: A = Single, B = Double
▪ Uncertain:
Rational Preferences
▪ Axiom of rationality
▪ We need some constraints on preferences to call them rational

▪ Counterexample
▪ An agent with intransitive preferences can lead to iterative loss
▪ e.g., Rock Paper Scissors
Maximum Expected Utility (MEU)
▪ Rational preferences
▪ Imply behaviors describable as maximization of expected utility

▪ Theorem [Ramsey, 1931; von Neumann & Morgenstern, 1944]


▪ Given any preferences satisfying these axioms, there exists a real-valued function U:

▪ i.e., values assigned by U preserve preferences of both certain and uncertain outcomes

▪ Principle of Maximum Expected Utility (MEU)


▪ A rational agent should choose the action that maximizes its expected utility, given its knowledge
▪ Note: An agent can be rational (consistent with MEU) without being aware of utilities and probabilities
▪ e.g., A reflex agent can be rational - a lookup table for perfect tic-tac-toe, a reflex vacuum cleaner
Human Utilities?
▪ Let’s do a poll! - famous example of Allais (1953)
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?
▪ e.g., A sequence of rewards collected up to time t: [r0, r1, r2, … rt]

▪ More or less? [1, 2, 3] or [2, 3, 4]

▪ Now or later? [0, 0, 1] or [1, 0, 0]

▪ What about this? [1, 2, 3] or [0, 0, 8]


Discounting Utilities
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later

▪ One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps


Stationary Preferences
▪ To preserve stationary preferences:

▪ We adopt additive discounted utility:


U([r0, r1, r2,…]) = r0 + γr1 + γ2r2 + … where γ  (0,1] is the discount factor
▪ γ = 1: additive utility
▪ γ < 1: discounted utility
Outline
▪ Rational Decision
▪ Decision Network
▪ Markov Decision Process
Decision Network

Umbrella

leave take
U

Weather

sun rain

Forecast

bad good
Decision Network
▪ Maximum Expected Utility (MEU)
▪ Choose the action that maximizes the expected utility given the evidence

▪ Decision networks can be used to work with MEU


▪ Bayes nets with nodes for utility and actions Umbrella
▪ Calculate the expected utility for each action
U
▪ New node types:
▪ Chance node Weather
▪ Circle, just like random variables in Bayes Nets
▪ Action node
▪ Rectangle, act as observed evidence (but selected by agent)
▪ Utility node
▪ Diamond, depends on action and chance nodes Forecast
Decision Network
▪ Procedure of action selection with MEU
Umbrella
▪ Instantiate all evidence
U
▪ Set action node(s) each possible way
▪ Calculate posterior for all parents of utility node, Weather
given the evidence
▪ Calculate the expected utility for each action
▪ Choose the action maximizing the expected utility Forecast
Decision Network Example (Simple)
▪ EU(action) = Expected Utility of taking action
▪ MEU(evidence = ø) = Maximum Expected Utility, given no evidence (i.e., empty)

Umbrella = leave
A W U(A,W)
Umbrella
leave sun 100
leave rain 0
U
take sun 20
Umbrella = take take rain 70
Weather

W P(W)
sun 0.7
Optimal decision = leave rain 0.3
Decisions as Outcome Trees
No evidence
{}

Umbrella

U Weather | {} Weather | {}

Weather

U(t,s) U(t,r) U(l,s) U(l,r)

▪ Almost exactly like expectimax, but… what’s changed?


▪ Computation of probabilities at each uncertain node
Decision Network Example (with Evidence)
▪ EU(action|evidence) = Expected Utility of taking action, given evidence
▪ MEU(evidence) = Maximum Expected Utility, given evidence

Umbrella = leave if Forecast = bad Umbrella A W U(A,W)


leave sun 100
leave rain 0
U take sun 20
take rain 70
Umbrella = take if Forecast = bad
Weather
W P(W|F=bad)
sun 0.34
rain 0.66

Optimal decision = take if Forecast = bad


Forecast
=bad
Decisions as Outcome Trees
Evidence
(Forecast=bad)
{b}

Umbrella

U W | {b} W | {b}

Weather

U(t,s) U(t,r) U(l,s) U(l,r)

Forecast
=bad
▪ Almost exactly like expectimax, but… what’s changed?
▪ Computation of probabilities at each uncertain node
Value of Perfect Information (VPI)
▪ MEU(e) = Maximum Expected Utility, given the known evidence E=e
▪ We assume that the evidence e is known (i.e., nodes we know)
▪ Calculating MEU requires taking a maximum over several expectations (i.e., one EU per action)

▪ VPI(E'|e) = Expected gain in utility for knowing a new E', given the known e so far
▪ E’: the random variable(s) we want to know the value of (i.e., a new evidence to reveal)
▪ e: the random variable(s) we already know the value of (i.e., a known evidence)
▪ Calculating VPI requires taking an expectation over several MEUs
▪ i.e., one MEU per possible outcome of E', because we don’t know the value of E'
Value of Perfect Information (VPI)
{+e}
▪ Assume we have evidence E=e. Value of MEU if we act now: a

P(s | +e)
U
▪ Assume we see that E’ = e’. Value of MEU if we act then: {+e, +e’}
a

P(s | +e, +e’)


▪ E’ is a random variable whose value is unknown U
▪ We don’t know what e’ will be
{+e}
▪ Value of MEU if E’ is revealed and then we act: P(+e’ | +e) P(-e’ | +e)
{+e, +e’} {+e, -e’}
a

U
▪ Value of perfect information:
▪ How much MEU goes up by revealing E’ first then acting, over acting now:
VPI Example
A W U
MEU with no evidence Umbrella leave sun 100
leave rain 0
U
W P(W) take sun 20
MEU if Forecast is bad (optimal decision = take) sun 0.7 Weather take rain 70
rain 0.3
W P(W|F=bad) W P(W|F=good)
sun 0.34 sun 0.95
MEU if Forecast is good (optimal decision = leave) rain 0.66 rain 0.05
Forecast

MEU with evidence F P(F)


good 0.59
bad 0.41

Value of perfect information


Outline
▪ Rational Decision
▪ Decision Network
▪ Markov Decision Process
▪ Defining MDPs
▪ Solving MDPs (Value-based)
▪ Solving MDPs (Policy-based)

Search + Probabilities + Time + Actions


Markov Decision Process
Markov Model Decision Network
A
X0 X1 X2 X3
U
X

Markov Decision Process


X0 X1 X2 X3

A0 A1 A2

R0 R1 R2
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement
▪ Actions do not always go as planned
▪ 80% of the time, North → North
▪ 10% of the time, North → West or North → East
▪ If there is a wall in the direction of movement, the agent stays put

▪ The agent receives rewards each time step


▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)

▪ Goal is to maximize sum of rewards


Example: Grid World
Deterministic Grid World Stochastic Grid World

Up

Up
Left Up Right
Markov Decision Process
▪ An MDP is defined by: Terminal
▪ Set of states s  S
▪ Set of actions a  A
▪ Transition function T(s, a, s’) Terminal
▪ Probability that a at s leads to s’, i.e., P(s’| s, a) T, R
▪ Also called the model or the dynamics
▪ Reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’) Start
▪ Start state Action
▪ Terminal state (optional)

▪ Action outcomes depend only on the current state (i.e., Markov property)

▪ MDPs are non-deterministic search problems


▪ One way to solve them is with expectimax search
▪ We’ll have a new tool soon
Policies
▪ Deterministic single-agent search problems
▪ Find an optimal plan, or sequence of actions, from start to a goal

▪ For MDPs in non-deterministic search problems, we want an optimal policy *: S → A


▪ A policy  gives an action for each state
▪ An optimal policy is one that maximizes expected utility (MEU) if followed
▪ An explicit policy defines a reflex agent

▪ Note: expectimax didn’t compute entire policies


▪ It computes the action for a single state only

Example policy on the grid world


Example: Policies in Grid World?

▪ 80% of the time, North → North


▪ 10% of the time, North → West
▪ 10% of the time, North → East
Example: Optimal Policies in Grid World

▪ 80% of the time, North → North


▪ 10% of the time, North → West
▪ 10% of the time, North → East

The living reward R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0
(Maximum time step exists)
Example: Optimal Policies in Grid World?

▪ 80% of the time, North → North


▪ 10% of the time, North → West
▪ 10% of the time, North → East

The living reward R(s) > 0?


MDP Search Trees
▪ Each MDP state projects an expectimax-like search tree

s s: state

(s, a): q-state s, a


(s, a, s’): transition

s, a, s’ T(s, a, s’) = P(s’|s, a)


R(s, a, s’)
s’
Discounting in MDP

▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once

▪ Why discount?
▪ Sooner rewards probably do have
higher utility than later rewards
▪ Also helps our algorithms converge

▪ Example: discount of 0.5


▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Quiz: Discounting
▪ Given:
▪ States: a (terminal), b, c, d, e (terminal)
▪ Actions: Go east, Go west
▪ Rewards: only at terminal states a and e (no living rewards)
▪ Transitions: deterministic

▪ Quiz 1: For  = 1, what is the optimal policy?

▪ Quiz 2: For  = 0.1, what is the optimal policy?

▪ Quiz 3: For which  are West and East equally good when in state d?
Summary: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount ) s,a,s’
s’

▪ MDP quantities so far:


▪ Policy = Choice of action for each state
▪ Utility = Sum of (discounted) rewards
Next
▪ Utility and Rationality
▪ Decision Network
▪ Markov Decision Process
▪ Defining MDPs
▪ Solving MDPs (Value-based)
▪ Solving MDPs (Policy-based)

Search + Probabilities + Time + Actions

You might also like