0% found this document useful (0 votes)
6 views3 pages

Into To Ai

The document discusses concepts in probabilistic planning and decision theory, emphasizing the calculation of expected utility and the Markov Decision Problem (MDP). It outlines the components of MDP, including states, actions, transition models, and reward functions, and introduces the idea of optimal policies through value iteration. The aim is to maximize expected cumulative rewards while considering the stochastic nature of actions and outcomes.

Uploaded by

daniel.widjaja18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views3 pages

Into To Ai

The document discusses concepts in probabilistic planning and decision theory, emphasizing the calculation of expected utility and the Markov Decision Problem (MDP). It outlines the components of MDP, including states, actions, transition models, and reward functions, and introduces the idea of optimal policies through value iteration. The aim is to maximize expected cumulative rewards while considering the stochastic nature of actions and outcomes.

Uploaded by

daniel.widjaja18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

week 10

1 ′ ′ 0 ′ 1 1
Q (⟨3, 3⟩, E) = ∑ P (s |⟨3, 3⟩, E)[R(⟨3, 3⟩, E, s ) + γV (s )]V (⟨3, 3⟩) = max Q (⟨3, 3⟩, a)
a∈{E,Q,N ,S}

s ∈{⟨3,4⟩,⟨2,3⟩,⟨3,3⟩}

Probabilistic Planning
Action set, A, available
For given action, the mapping is stochastic (random)
Goal: Find the next action

Decision Theory
Uncertainty changes the decision making process
Involve both probability theory (deal with chances) and utility theory (deal with consequences)
Find :
Action that has maximum Expected Utility
Focus on single decision first, then come back to sequential decisions (MDP)

Maximum Expected Utility


Let A = {a 1, a2 , . . } be set of actions and O = {o 1, o2 , . . } be outcomes
When agent is faced with a single decision:

1. Compute probability of outcome given action a i

P (O j |a i )

2. Compute reward (utility) of outcome after taking a i

U (a i , o j )

3. Expected Utility:
EU (a i ) = ∑ P (O j |a i )U (a i , o j )
oj
4. Take the action with maximum EU
M EU = max EU (a i )

Markov Decision Problem


Defines by 5-tuples:

Symbol Meaning
S Set of states (e.g., locations, configurations)
A Set of actions the agent can take
T Transition model: T (s, a, s′) = Pr(s′ ∣ s, a) — probability of reaching state s′ from state s after taking action a
R Reward function:
R(s) - only current state

R(s, a) - current state and action

R(s, a, s′) — State, action & next state

γ Discount factor (0 ≤ γ ≤ 1) — how much future rewards are worth compared to immediate ones

Satisfy the Markov property

Local Markov Property:


A random variable X is independent of its non-descendants given ALL its parents * ONLY

X i ⊥Non-Descendants(X i )|Parents(X i )

- Terminology:
-Decision epoch --> steps
Can be finite or infinite horizon
- Terminal state: Do not allow any transitions out
MDP MEU
EU (s, a) = ∑ P (o j ∥a j )U (a i , o j ) Q(s, a) = ∑ P (s′∥s, a)[R(s, a, s′) + γV (s′)]
oj s′

M EU (s) V (s)

max EU (s, a) max Q(s, a)

Contextually, the probability of moving to next state depends only on the current state and action, not the full history
Aim: Calculate policy (strategy) that maximise the expected cumulative reward
π ∞ t
Value of a policy π : V (s) = E [∑ γ R(s t , π(s t ), s t+1 )]
t=0

Unlike traditional plans, it is not just a sequence of actions


it maximise expected utility over all state
types:
Stationary policy: same rule applied to each decision epoch
Non-stationary policy: Rule change over decision epoch
Deterministic policy: every rule always maps state to one action with absolute certainty

Optimal Policy: Value iteration


Initialise V 0
(s) = 0 for all state
Repeat these computation until convergence (reach optimal); ie ->V t
(s) and V t−1
(s) are very close
t t−1
Q (s, a) = ∑ P (s′|s, a)[R(s, a, s′) + γV (s′)]

s′

t−1
= r(s, a) + γ ∑ P (s′|s, a)V (s′)

s′

t 1
V (s) = max Q (s, a)

∗ t
π (s) = arg max Q (s, a)

For infinite horizon, optimal policy is stationary

You might also like