Into To Ai
Into To Ai
1 ′ ′ 0 ′ 1 1
Q (⟨3, 3⟩, E) = ∑ P (s |⟨3, 3⟩, E)[R(⟨3, 3⟩, E, s ) + γV (s )]V (⟨3, 3⟩) = max Q (⟨3, 3⟩, a)
a∈{E,Q,N ,S}
′
s ∈{⟨3,4⟩,⟨2,3⟩,⟨3,3⟩}
Probabilistic Planning
Action set, A, available
For given action, the mapping is stochastic (random)
Goal: Find the next action
Decision Theory
Uncertainty changes the decision making process
Involve both probability theory (deal with chances) and utility theory (deal with consequences)
Find :
Action that has maximum Expected Utility
Focus on single decision first, then come back to sequential decisions (MDP)
P (O j |a i )
U (a i , o j )
3. Expected Utility:
EU (a i ) = ∑ P (O j |a i )U (a i , o j )
oj
4. Take the action with maximum EU
M EU = max EU (a i )
Symbol Meaning
S Set of states (e.g., locations, configurations)
A Set of actions the agent can take
T Transition model: T (s, a, s′) = Pr(s′ ∣ s, a) — probability of reaching state s′ from state s after taking action a
R Reward function:
R(s) - only current state
γ Discount factor (0 ≤ γ ≤ 1) — how much future rewards are worth compared to immediate ones
X i ⊥Non-Descendants(X i )|Parents(X i )
- Terminology:
-Decision epoch --> steps
Can be finite or infinite horizon
- Terminal state: Do not allow any transitions out
MDP MEU
EU (s, a) = ∑ P (o j ∥a j )U (a i , o j ) Q(s, a) = ∑ P (s′∥s, a)[R(s, a, s′) + γV (s′)]
oj s′
M EU (s) V (s)
Contextually, the probability of moving to next state depends only on the current state and action, not the full history
Aim: Calculate policy (strategy) that maximise the expected cumulative reward
π ∞ t
Value of a policy π : V (s) = E [∑ γ R(s t , π(s t ), s t+1 )]
t=0
s′
t−1
= r(s, a) + γ ∑ P (s′|s, a)V (s′)
s′
t 1
V (s) = max Q (s, a)
∗ t
π (s) = arg max Q (s, a)