0% found this document useful (0 votes)

29 views40 pages

(24F-COSE361) 5. Markov Decision Process

The document discusses Markov Decision Processes (MDPs) within the context of artificial intelligence, focusing on rational decision-making and the structure of decision networks. It outlines the components of MDPs, including states, actions, transition functions, and reward functions, and emphasizes the importance of maximizing expected utility. Additionally, it explores the concept of policies in MDPs and how they can be used to determine optimal actions in uncertain environments.

Uploaded by

cana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views40 pages

(24F-COSE361) 5. Markov Decision Process

Uploaded by

cana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

COSE361: Artificial Intelligence

Markov Decision Processes (1)

[Adapted from the original slides from CS188 Intro to AI at UC Berkeley]

Outline
▪ Rational Decision
▪ Decision Network
▪ Markov Decision Process
RECAP: What should we build?
Machines with artificial intelligence is to …
Think

Think like people? Think rationally?

Cognitive/Neuroscience Laws of thought process
→ Natural intelligence? → Unknown and uncertainties?

Human Rationality

Act like people? Act rationally?

Turing test (Alan Turing) Agent (i.e., sth that acts, to do)
→ Mistakes with a human touch? → to achieve the best outcome!
(인간미 있는 실수)
Act
RECAP: In a nutshell
▪ Artificial Intelligence has focused on the study and construction
of rational agents that do the right thing.
▪ The right thing is defined by the objective that we provide to the agent.
Recap: Rational Decisions
▪ We’ll use the term rational in a very specific, technical way:
▪ Rational: maximally achieving pre-defined goals
▪ Goals are expressed in terms of the utility of outcomes
▪ Utilities describe an agent’s preferences for the outcomes (states of the world)
▪ Utilities are quantified as functions of outcomes → real numbers
Recap: Rational Decisions
▪ We’ll use the term rational in a very specific, technical way:
▪ Rational: maximally achieving pre-defined goals
▪ Goals are expressed in terms of the utility of outcomes
▪ World is uncertain, so we may want to use expected utility
▪ Being rational means acting to maximize your expected utility
Utilities

▪ F(outcomes (states of the world)) = real numbers (rewards or incentives)

▪ Utilities summarize the agent’s preferences on outcomes
▪ In a game, utility may be simple scores (e.g., 300 pts or -100 pts)
▪ Theorem: any “rational” preferences can be summarized as a utility function

▪ We preset utilities and let behaviors emerge

▪ Why don’t we prescribe behaviors or let agents pick utilities?
▪ Safety, adaptability, and practicality
Utilities with Uncertain Outcomes
Getting ice cream

Get Single Get Double

Oops Whew!
Preferences

▪ Notation: Certain outcome Uncertain outcome

▪ Outcomes: A, B
A
▪ Uncertain outcomes: [p, A; (1-p), B]
p 1-p
▪ Preference:
▪ Indifference:
A B

▪ An agent must have preferences among outcomes

▪ e.g., preferences on ice-creams prizes/lottery
▪ Certain: A = Single, B = Double
▪ Uncertain:
Rational Preferences
▪ Axiom of rationality
▪ We need some constraints on preferences to call them rational

▪ Counterexample
▪ An agent with intransitive preferences can lead to iterative loss
▪ e.g., Rock Paper Scissors
Maximum Expected Utility (MEU)
▪ Rational preferences
▪ Imply behaviors describable as maximization of expected utility

▪ Theorem [Ramsey, 1931; von Neumann & Morgenstern, 1944]

▪ Given any preferences satisfying these axioms, there exists a real-valued function U:

▪ i.e., values assigned by U preserve preferences of both certain and uncertain outcomes

▪ Principle of Maximum Expected Utility (MEU)

▪ A rational agent should choose the action that maximizes its expected utility, given its knowledge
▪ Note: An agent can be rational (consistent with MEU) without being aware of utilities and probabilities
▪ e.g., A reflex agent can be rational - a lookup table for perfect tic-tac-toe, a reflex vacuum cleaner
Human Utilities?
▪ Let’s do a poll! - famous example of Allais (1953)
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?
▪ e.g., A sequence of rewards collected up to time t: [r0, r1, r2, … rt]

▪ More or less? [1, 2, 3] or [2, 3, 4]

▪ Now or later? [0, 0, 1] or [1, 0, 0]

▪ What about this? [1, 2, 3] or [0, 0, 8]

Discounting Utilities
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later

▪ One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Stationary Preferences
▪ To preserve stationary preferences:

▪ We adopt additive discounted utility:

U([r0, r1, r2,…]) = r0 + γr1 + γ2r2 + … where γ  (0,1] is the discount factor
▪ γ = 1: additive utility
▪ γ < 1: discounted utility
Outline
▪ Rational Decision
▪ Decision Network
▪ Markov Decision Process
Decision Network

Umbrella

leave take
U

Weather

sun rain

Forecast

bad good
Decision Network
▪ Maximum Expected Utility (MEU)
▪ Choose the action that maximizes the expected utility given the evidence

▪ Decision networks can be used to work with MEU

▪ Bayes nets with nodes for utility and actions Umbrella
▪ Calculate the expected utility for each action
U
▪ New node types:
▪ Chance node Weather
▪ Circle, just like random variables in Bayes Nets
▪ Action node
▪ Rectangle, act as observed evidence (but selected by agent)
▪ Utility node
▪ Diamond, depends on action and chance nodes Forecast
Decision Network
▪ Procedure of action selection with MEU
Umbrella
▪ Instantiate all evidence
U
▪ Set action node(s) each possible way
▪ Calculate posterior for all parents of utility node, Weather
given the evidence
▪ Calculate the expected utility for each action
▪ Choose the action maximizing the expected utility Forecast
Decision Network Example (Simple)
▪ EU(action) = Expected Utility of taking action
▪ MEU(evidence = ø) = Maximum Expected Utility, given no evidence (i.e., empty)

Umbrella = leave
A W U(A,W)
Umbrella
leave sun 100
leave rain 0
U
take sun 20
Umbrella = take take rain 70
Weather

W P(W)
sun 0.7
Optimal decision = leave rain 0.3
Decisions as Outcome Trees
No evidence
{}

Umbrella

U Weather | {} Weather | {}

Weather

U(t,s) U(t,r) U(l,s) U(l,r)

▪ Almost exactly like expectimax, but… what’s changed?

▪ Computation of probabilities at each uncertain node
Decision Network Example (with Evidence)
▪ EU(action|evidence) = Expected Utility of taking action, given evidence
▪ MEU(evidence) = Maximum Expected Utility, given evidence

Umbrella = leave if Forecast = bad Umbrella A W U(A,W)

leave sun 100
leave rain 0
U take sun 20
take rain 70
Umbrella = take if Forecast = bad
Weather
W P(W|F=bad)
sun 0.34
rain 0.66

Optimal decision = take if Forecast = bad

Forecast
=bad
Decisions as Outcome Trees
Evidence
(Forecast=bad)
{b}

Umbrella

U W | {b} W | {b}

Weather

U(t,s) U(t,r) U(l,s) U(l,r)

Forecast
=bad
▪ Almost exactly like expectimax, but… what’s changed?
▪ Computation of probabilities at each uncertain node
Value of Perfect Information (VPI)
▪ MEU(e) = Maximum Expected Utility, given the known evidence E=e
▪ We assume that the evidence e is known (i.e., nodes we know)
▪ Calculating MEU requires taking a maximum over several expectations (i.e., one EU per action)

▪ VPI(E'|e) = Expected gain in utility for knowing a new E', given the known e so far
▪ E’: the random variable(s) we want to know the value of (i.e., a new evidence to reveal)
▪ e: the random variable(s) we already know the value of (i.e., a known evidence)
▪ Calculating VPI requires taking an expectation over several MEUs
▪ i.e., one MEU per possible outcome of E', because we don’t know the value of E'
Value of Perfect Information (VPI)
{+e}
▪ Assume we have evidence E=e. Value of MEU if we act now: a

P(s | +e)
U
▪ Assume we see that E’ = e’. Value of MEU if we act then: {+e, +e’}
a

P(s | +e, +e’)

▪ E’ is a random variable whose value is unknown U
▪ We don’t know what e’ will be
{+e}
▪ Value of MEU if E’ is revealed and then we act: P(+e’ | +e) P(-e’ | +e)
{+e, +e’} {+e, -e’}
a

U
▪ Value of perfect information:
▪ How much MEU goes up by revealing E’ first then acting, over acting now:
VPI Example
A W U
MEU with no evidence Umbrella leave sun 100
leave rain 0
U
W P(W) take sun 20
MEU if Forecast is bad (optimal decision = take) sun 0.7 Weather take rain 70
rain 0.3
W P(W|F=bad) W P(W|F=good)
sun 0.34 sun 0.95
MEU if Forecast is good (optimal decision = leave) rain 0.66 rain 0.05
Forecast

MEU with evidence F P(F)

good 0.59
bad 0.41

Value of perfect information

Outline
▪ Rational Decision
▪ Decision Network
▪ Markov Decision Process
▪ Defining MDPs
▪ Solving MDPs (Value-based)
▪ Solving MDPs (Policy-based)

Search + Probabilities + Time + Actions

Markov Decision Process
Markov Model Decision Network
A
X0 X1 X2 X3
U
X

Markov Decision Process

X0 X1 X2 X3

A0 A1 A2

R0 R1 R2
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement
▪ Actions do not always go as planned
▪ 80% of the time, North → North
▪ 10% of the time, North → West or North → East
▪ If there is a wall in the direction of movement, the agent stays put

▪ The agent receives rewards each time step

▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)

▪ Goal is to maximize sum of rewards

Example: Grid World
Deterministic Grid World Stochastic Grid World

Up
Left Up Right
Markov Decision Process
▪ An MDP is defined by: Terminal
▪ Set of states s  S
▪ Set of actions a  A
▪ Transition function T(s, a, s’) Terminal
▪ Probability that a at s leads to s’, i.e., P(s’| s, a) T, R
▪ Also called the model or the dynamics
▪ Reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’) Start
▪ Start state Action
▪ Terminal state (optional)

▪ Action outcomes depend only on the current state (i.e., Markov property)

▪ MDPs are non-deterministic search problems

▪ One way to solve them is with expectimax search
▪ We’ll have a new tool soon
Policies
▪ Deterministic single-agent search problems
▪ Find an optimal plan, or sequence of actions, from start to a goal

▪ For MDPs in non-deterministic search problems, we want an optimal policy *: S → A

▪ A policy  gives an action for each state
▪ An optimal policy is one that maximizes expected utility (MEU) if followed
▪ An explicit policy defines a reflex agent

▪ Note: expectimax didn’t compute entire policies

▪ It computes the action for a single state only

Example policy on the grid world

Example: Policies in Grid World?

▪ 80% of the time, North → North

▪ 10% of the time, North → West
▪ 10% of the time, North → East
Example: Optimal Policies in Grid World

▪ 80% of the time, North → North

▪ 10% of the time, North → West
▪ 10% of the time, North → East

The living reward R(s) = -0.01 R(s) = -0.03 R(s) = -0.4 R(s) = -2.0
(Maximum time step exists)
Example: Optimal Policies in Grid World?

▪ 80% of the time, North → North

▪ 10% of the time, North → West
▪ 10% of the time, North → East

The living reward R(s) > 0?

MDP Search Trees
▪ Each MDP state projects an expectimax-like search tree

s s: state

(s, a): q-state s, a

(s, a, s’): transition

s, a, s’ T(s, a, s’) = P(s’|s, a)

R(s, a, s’)
s’
Discounting in MDP

▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once

▪ Why discount?
▪ Sooner rewards probably do have
higher utility than later rewards
▪ Also helps our algorithms converge

▪ Example: discount of 0.5

▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Quiz: Discounting
▪ Given:
▪ States: a (terminal), b, c, d, e (terminal)
▪ Actions: Go east, Go west
▪ Rewards: only at terminal states a and e (no living rewards)
▪ Transitions: deterministic

▪ Quiz 1: For  = 1, what is the optimal policy?

▪ Quiz 2: For  = 0.1, what is the optimal policy?

▪ Quiz 3: For which  are West and East equally good when in state d?
Summary: Defining MDPs
▪ Markov decision processes: s
▪ Set of states S
▪ Start state s0 a
▪ Set of actions A s, a
▪ Transitions P(s’|s,a) (or T(s,a,s’))
▪ Rewards R(s,a,s’) (and discount ) s,a,s’
s’

▪ MDP quantities so far:

▪ Policy = Choice of action for each state
▪ Utility = Sum of (discounted) rewards
Next
▪ Utility and Rationality
▪ Decision Network
▪ Markov Decision Process
▪ Defining MDPs
▪ Solving MDPs (Value-based)
▪ Solving MDPs (Policy-based)

Search + Probabilities + Time + Actions

Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Pomdps
No ratings yet
Pomdps
76 pages
CS480 Lecture October26th
No ratings yet
CS480 Lecture October26th
63 pages
Uncertainty and Utilities
No ratings yet
Uncertainty and Utilities
91 pages
Decision Making
No ratings yet
Decision Making
63 pages
Decision Making Under Uncertainty
No ratings yet
Decision Making Under Uncertainty
63 pages
Lecture 4: Sequential Decision Making: Simon Parsons
No ratings yet
Lecture 4: Sequential Decision Making: Simon Parsons
94 pages
Decision Making Under Uncertainty
No ratings yet
Decision Making Under Uncertainty
63 pages
MAS - Class
No ratings yet
MAS - Class
71 pages
Lec 08
No ratings yet
Lec 08
59 pages
06 MDP
No ratings yet
06 MDP
89 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
Expectimax Search
No ratings yet
Expectimax Search
29 pages
Lec 09
No ratings yet
Lec 09
51 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
07 Expectimax
No ratings yet
07 Expectimax
46 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
VERB TO BE - Boardgame PDF
100% (3)
VERB TO BE - Boardgame PDF
2 pages
Chapter. 07 - Expectimax Search and Utilities
No ratings yet
Chapter. 07 - Expectimax Search and Utilities
47 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
A16 Simple Decisions
No ratings yet
A16 Simple Decisions
16 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
325 Notes
No ratings yet
325 Notes
23 pages
Lec 26
No ratings yet
Lec 26
21 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
A16 Simple Decisions
No ratings yet
A16 Simple Decisions
16 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
RL 1
No ratings yet
RL 1
30 pages
c26 Dtheory
No ratings yet
c26 Dtheory
19 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
A16-Simple Decisions
No ratings yet
A16-Simple Decisions
16 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Wa0003.
No ratings yet
Wa0003.
16 pages
Expectimax Search and Utilities
No ratings yet
Expectimax Search and Utilities
44 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Cs 188 HW Solutions Artificial Intelligence
No ratings yet
Cs 188 HW Solutions Artificial Intelligence
7 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Ai Lecture-5
No ratings yet
Ai Lecture-5
34 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Lec17-Decision Networks VPI
No ratings yet
Lec17-Decision Networks VPI
25 pages
Unit 4
No ratings yet
Unit 4
6 pages
SP14 CS188 Lecture 20 - Decision Diagrams and VPI - Print
No ratings yet
SP14 CS188 Lecture 20 - Decision Diagrams and VPI - Print
23 pages
AI Notes
No ratings yet
AI Notes
37 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
cs188 Fa22 Note16
No ratings yet
cs188 Fa22 Note16
8 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Slides
No ratings yet
Slides
10 pages
Quartiles For Grouped Data
50% (2)
Quartiles For Grouped Data
21 pages
Air Core Reactor
No ratings yet
Air Core Reactor
4 pages
Self Assessment Report S.R.T.M.University Nanded
No ratings yet
Self Assessment Report S.R.T.M.University Nanded
18 pages
Valve Group-Control - Auxiliary
No ratings yet
Valve Group-Control - Auxiliary
3 pages
Engineering Industrial Training Student Guide Spread
0% (1)
Engineering Industrial Training Student Guide Spread
11 pages
Bailment and Pledge Are Two Special Contracts That Are Often Confused
No ratings yet
Bailment and Pledge Are Two Special Contracts That Are Often Confused
9 pages
Introduction To Clinical Trial
No ratings yet
Introduction To Clinical Trial
27 pages
Functions in Staffing
No ratings yet
Functions in Staffing
11 pages
Management Accounting
100% (2)
Management Accounting
6 pages
Diversified Well Logging
No ratings yet
Diversified Well Logging
11 pages
AF3313 Course Outline
No ratings yet
AF3313 Course Outline
6 pages
Integrative Couples Group Treatment For Emerging Adults With ADHD Symptoms
No ratings yet
Integrative Couples Group Treatment For Emerging Adults With ADHD Symptoms
11 pages
Mendezona v. Vda. de Goitia
No ratings yet
Mendezona v. Vda. de Goitia
2 pages
INTERJECTIONS-WPS Office
No ratings yet
INTERJECTIONS-WPS Office
5 pages
CPC2
No ratings yet
CPC2
51 pages
Diana Hicks Andrew Littlejohn
No ratings yet
Diana Hicks Andrew Littlejohn
74 pages
We All Go Through Trials and Temptations But Sometimes Trials Knock at Our Heart and Surround Us
No ratings yet
We All Go Through Trials and Temptations But Sometimes Trials Knock at Our Heart and Surround Us
3 pages
زنجيره تلرانسي
No ratings yet
زنجيره تلرانسي
65 pages
Tafseer Humazah - Teen
No ratings yet
Tafseer Humazah - Teen
17 pages
Revolutionary Next Generation Human Computer Interaction
No ratings yet
Revolutionary Next Generation Human Computer Interaction
45 pages
All Things Bright and Beautiful By: John Rutter Lyric By: Cecil Frances Alexander
No ratings yet
All Things Bright and Beautiful By: John Rutter Lyric By: Cecil Frances Alexander
13 pages
Forensic Science ISO IEC 17025 Appendix Effective Feb 2020
No ratings yet
Forensic Science ISO IEC 17025 Appendix Effective Feb 2020
19 pages
The Impact of Technology On Human Learning
No ratings yet
The Impact of Technology On Human Learning
1 page
Planner 2024-25
No ratings yet
Planner 2024-25
1 page
Marcia Dossier
No ratings yet
Marcia Dossier
4 pages
The Qualitative and Quantitative World of Robert Wilson's Theater
No ratings yet
The Qualitative and Quantitative World of Robert Wilson's Theater
2 pages
Croatia Report 2019
No ratings yet
Croatia Report 2019
17 pages
Models of Illness: DR Bruce Davies
No ratings yet
Models of Illness: DR Bruce Davies
16 pages
Work Stress: The Making of Modern Epidemic
No ratings yet
Work Stress: The Making of Modern Epidemic
2 pages