0% found this document useful (0 votes)
14 views102 pages

Basics RL

Uploaded by

asinghal2122003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views102 pages

Basics RL

Uploaded by

asinghal2122003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 102

Agents & Environments

Outline
• Agents and environments
• Rationality
• PEAS specification
• Environment types
• Agent types

2
Agents
• An agent is anything that can be viewed as perceiving
its environment through sensors and acting upon
that environment through actuators

• Human agent:
– eyes, ears, and other organs for sensors
– hands,legs, mouth, and other body parts for actuators

• Robotic agent:
– cameras and laser range finders for sensors
– various motors for actuators

3
Examples of Agents
• Physically grounded agents
– Intelligent buildings
– Autonomous spacecraft

• Softbots
– Expert Systems
– IBM Watson
Intelligent Agents
• Have sensors, effectors
• Implement mapping from percept
sequence to actions
percepts

Environment Agent

actions

• Performance Measure
Rational Agents
• An agent should strive to do the right thing, based on what
it can perceive and the actions it can perform. The right
action is the one that will cause the agent to be most
successful

• Performance measure: An objective criterion for success of


an agent's behavior

• E.g., performance measure of a vacuum-cleaner agent


could be amount of dirt cleaned up, amount of time taken,
amount of electricity consumed, amount of noise
generated, etc.

6
Ideal Rational Agent
“For each possible percept sequence, does
whatever action is expected to maximize its
performance measure on the basis of evidence
perceived so far and built-in knowledge.''

• Rationality vs omniscience?
• Acting in order to obtain valuable information
Bounded Rationality
• We have a performance measure to optimize
• Given our state of knowledge
• Choose optimal action
• Given limited computational resources
PEAS: Specifying Task Environments
• PEAS: Performance measure, Environment, Actuators,
Sensors

• Must first specify the setting for intelligent agent


design

• Example: the task of designing an automated taxi


driver:
– Performance measure
– Environment
– Actuators
– Sensors
9
PEAS
• Agent: Automated taxi driver

• Performance measure:
– Safe, fast, legal, comfortable trip, maximize profits
• Environment:
– Roads, other traffic, pedestrians, customers
• Actuators:
– Steering wheel, accelerator, brake, signal, horn

• Sensors:
– Cameras, sonar, speedometer, GPS, odometer, engine sensors,
keyboard

10
PEAS
• Agent: Medical diagnosis system
• Performance measure:
– Healthy patient, minimize costs, lawsuits

• Environment:
– Patient, hospital, staff

• Actuators:
– Screen display (questions, tests, diagnoses, treatments,
referrals)
• Sensors:
– (entry of symptoms, findings, patient's answers)

11
Properties of Environments
• Observability: full vs. partial vs. non

• Deterministic vs. stochastic

• Episodic vs. sequential

• Static vs. Semi-dynamic vs. dynamic

• Discrete vs. continuous

• Single Agent vs. Multi Agent (Cooperative, Competitive, Self-Interested)


RoboCup vs. Chess

Deep Blue Robot

• Static/Semi-dynamic  Dynamic
• Deterministic  Stochastic
• Observable  Partially observable
• Discrete  Continuous
• Sequential  Sequential
• Multi-Agent  Multi-Agent
1
4
More Examples
• 15-Puzzle
– Static – Deterministic – Fully Obs – Discrete – Seq – Single

• Poker
– Static – Stochastic – Partially Obs – Discrete – Seq – Multi-agent

• Medical Diagnosis
– Dynamic – Stochastic – Partially Obs – Continuous – Seq – Single

• Taxi Driving
– Dynamic – Stochastic – Partially Obs – Continuous – Seq – Multi-
agent
Simple reflex agents
AGENT Sensors

what world is
like now

ENVIRONMENT
what action
Condition/Action rules should I do now?

Effectors
Reflex agent with internal state
Sensors
What world was like
what world is
like now
How world evolves

ENVIRONMENT
what action
Condition/Action rules should I do now?

AGENT Effectors
Goal-based agents
Sensors
What world was like
what world is
How world evolves like now

ENVIRONMENT
What my actions do what it’ll be like if
I do actions A1-An

what action
Goals should I do now?

AGENT Effectors
Utility-based agents
Sensors
What world was like
what world is
How world evolves like now

ENVIRONMENT
what it’ll be like
What my actions do if I do acts A1-
An
How happy
would I be?
Utility function
what action
should I do now?

AGENT Effectors
Learning agents
Feedback
Sensors
What world was like
what world is
Learn how world evolves like now

ENVIRONMENT
what it’ll be like
Learn what my actions do if I do acts A1-
An
How happy
would I be?
Learn utility function
what action
should I do now?

AGENT Effectors
Fundamentals of Decision Theory
Uncertainty in AI

• Uncertainty comes in many forms


– uncertainty due to another agent’s policy
– uncertainty in outcome of my own action
– uncertainty in my knowledge of the world
– uncertainty in how the world evolves
Decision Theory
• is the study of agent’s choices
• lays out principles for how an agent arrives at
an optimal choice
– Good decisions may occasionally have unexpected
bad outcomes
– it is still a good decision if made properly

– Bad decisions may occasionally have good outcomes


if you are lucky
– it is still a bad decision
Steps in Decision Theory
1. List the possible actions (actions/decisions)

2. Identify the possible outcomes

3. List the payoff or profit or reward

4. Select one of the decision theory models

5. Apply the model and make your decision


Running Example
• Problem.
– The FoxPhone India Co. must decide whether or not to
expand its product line by manufacturing
indigenous smartphones in India

• Step 1: List the possible alternatives/actions


alternative: “a course of action or strategy
that may be chosen by the decision maker”
– (1) Construct a large plant to manufacture the
phones
– (2) Construct a small plant
– (3) Do nothing
The FoxPhone India Co.
• Step 2: Identify the states of nature
– Market for indigenous smartphones could be favorable
• high demand
– Market for indigenous smartphones could be unfavorable
• low demand

state of nature: “an outcome over which


the agent has little or no control”
e.g., lottery, coin-toss, whether it will rain today
The FoxPhone India Co.
• Step 3: List the possible rewards
– A reward for all possible combinations of actions and
states of nature
– Conditional values: “reward depends upon the
action and the state of nature”
• with a favorable market:
– a large plant produces a net profit of ₹200,000
– a small plant produces a net profit of ₹100,000
– no plant produces a net profit of ₹0
• with an unfavorable market:
– a large plant produces a net loss of ₹180,000
– a small plant produces a net loss of ₹20,000
– no plant produces a net profit of ₹0
Reward Tables

States of Nature
Actions

• A means of organizing a decision situation,


including the rewards from different
situations given the possible states of nature
The FoxPhone India Co.

States of Nature
Actions Favorable Market Unfavorable Market
Large plant ₹200,000 -₹180,000
Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
The FoxPhone India Co.
• Steps 4/5: Select an appropriate model
and apply it

– Model selection depends on the operating


environment and degree of uncertainty
Future Uncertainty
• Nondeterministic

• Probabilistic
Non-deterministic Uncertainty

States of Nature
Actions Favorable Market Unfavorable Market
Large plant ₹200,000 -₹180,000
Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0

• What should we do?


Maximax Criterion
“Go for the Gold”

• Select the decision that results in the


maximum of the maximum rewards

• A very optimistic decision criterion


– Decision maker assumes that the most favorable
state of nature for each action will occur

• Most risk prone agent


Maximax
S t a t e s of Nature Maximum
Decision Favorable Unfavorable in Row

Large plant ₹200,0 00 -₹180,0 00 ₹200,0 00


S m a l l plant ₹100,0 00 -₹20,00 0 ₹100,0 00
N o plant ₹0 ₹0 ₹0
• FoxPhone India C o. assumes that the most favorable state
of nature occurs for each decision action
• Select the maximum reward for each decision
– All three maximums occur if a favorable economy
prevails (a tie in case of no plant)
• Select the maximum of the maximums
– Maximum is ₹200,000; corresponding decision is to
build the large plant
– Potential loss of ₹180,000 is completely ignored
Maximin Criterion
“Best of the Worst”

• Select the decision that results in the


maximum of the minimum rewards

• A very pessimistic decision criterion


– Decision maker assumes that the minimum
reward occurs for each decision
action
– Select the maximum of these minimum
rewards

• Most risk averse agent


Maximin
States of Nature Minimum
Decision Favorable Unfavorable in Row

Large plant ₹200,000 -₹180,000 -₹180,000


Small plant ₹100,000 -₹20,000 -₹20,000
No plant ₹0 ₹0 ₹0
• FoxPhone India C o. assumes that the least favorable state
of nature occurs for each decision action
• Select the minimum reward for each decision
– All three minimums occur if an unfavorable economy
prevails (a tie in case of no plant)
• Select the maximum of the minimums
– Maximum is ₹0; corresponding decision is to do
nothing
– A conservative decision; largest possible gain, ₹0, is
much less than maximax
Equal Likelihood Criterion
• Assumes that all states of nature are equally likely to
occur
– Maximax criterion assumed the most favorable state of
nature occurs for each decision
– Maximin criterion assumed the least favorable state of
nature occurs for each decision

• Calculate the average reward for each action and select


the action with the maximum number
– Average reward: the sum of all rewards divided
by the number of states of nature

• Select the decision that gives the highest average


reward
Equal Likelihood
States of Nature Row
Decision Favorable Unfavorable Average

Large plant ₹200,000 -₹180,000 ₹10,000


Small plant ₹100,000 -₹20,000 ₹40,000
No plant ₹0 ₹0 ₹0
Row Averages
$200,000  $180,000
Large Plant   $10,000
2
$100,000  $20,000
S m a l l Plant   $40,000
$0  $0 2
Do Nothing  2  $0

• Select the decision with the highest


weighted value
– Maximum is ₹40,000; corresponding decision is to
Criterion of Realism
• Also known as the weighted average or Hurwicz criterion
– A compromise between an optimistic and pessimistic decision

• A coefficient of realism, , is selected by the decision


maker to indicate optimism or pessimism about the future
0 <  <1
When  is close to 1, the decision maker is optimistic.
When  is close to 0, the decision maker is pessimistic.

• Criterion of realism = (row maximum) + (1-)(row


minimum)
– A weighted average where maximum and minimum rewards
are weighted by  and (1 - ) respectively
Criterion of Realism
• Assume a coefficient of realism equal to 0.8
States of Nature Criterion of
Decision Favorable Unfavorable Realism

Large plant ₹200,000 -₹180,000 ₹124,000


Small plant ₹100,000 -₹20,000 ₹76,000
No plant ₹0 ₹0 ₹0
Weighted Averages
Large Plant = (0.8)(₹200,000) + (0.2)(-₹180,000) = ₹124,000
Small Plant = (0.8)(₹100,000) + (0.2)(-₹20,000) =
Do Nothing ₹76,000

=
Select the decision(0.8)(₹0) + (0.2)(₹0)
with the highest = ₹0 value
weighted
Maximum is ₹124,000; corresponding decision
is to build the large plant
Minimax Regret
• Regret/Opportunity Loss: “the difference
between the optimal reward and the actual
reward received”

• Choose the action that minimizes the maximum


regret associated with each action

– Start by determining the max regret for each action

– Pick the action with the minimum number


Regret Table
• If I knew the future, how much I’d regret my
decision…

• Regret for any state of nature is calculated by


subtracting each outcome in the column
from the best outcome in the same column
Minimax Regret
States of Nature
Favorable Unfavorable Row
Decision Payoff Regret Payoff Regret Maximum

Large plant ₹200,000 ₹0 -₹180,000 ₹180,000 ₹180,000


Small plant ₹100,000 ₹100,000 -₹20,000 ₹20,000 ₹100,000
No plant ₹0 ₹200,000 ₹0 ₹0 ₹200,000
Best payoff ₹200,000 ₹0

• Select the action with the lowest maximum


regret
Minimum is ₹100,000; corresponding
decision is to build a small plant
Summary of Results
Criterion Decision
Maximax Build a large plant
Maximin Do nothing
Equal Build a small plant
likelihoo Build a large plant
d Build a small plant
Realism
Minimax regret
Future Uncertainty
• Non deterministic

• Probabilistic
Probabilistic Uncertainty
• Decision makers know the probability of
occurrence for each possible outcome
– Attempt to maximize the expected reward

• Criteria for decision models in this


environment:
– Maximization of expected reward
– Minimization of expected regret
• Minimize expected regret = maximizing
expected reward!
Expected Reward (Q)
• called Expected Monetary Value (EMV) in DT literature
• “the probability weighted sum of possible rewards for
each action”
– Requires a reward table with conditional rewards and
probability assessments for all states of nature

Q(action a) = (reward of 1st state of nature)


X (probability of 1st state of nature)
+ (reward of 2nd state of nature)
X (probability of 2nd state of nature)
+ . . . + (reward of last state of nature)
X (probability of last state of nature)
The FoxPhone India Co.
• Suppose that the probability of a favorable market is exactly the same as
the probability of an unfavorable market. Which action would give the
greatest Q?

States of Nature
Favorable Mkt Unfavorable Mkt
Decision p = 0.5 p = 0.5 EMV

Large plant ₹200,000 -₹180,000 ₹10,000


Small plant ₹100,000 -₹20,000 ₹40,000
Q(large plant) = (0.5)(₹200,000)
No plant ₹0 + (0.5)(-₹180,000)
₹0 = ₹10,000 ₹0
Q(small plant) = (0.5)(₹100,000) + (0.5)(-₹-20,000) =
₹40,000 (0.5)(₹0) + (0.5)(₹0) =
Q(no plant) =
₹0
Build the small plant
small

40K

small

10K 40K 0

0.5 0.5 0.5 0.5 0.5 0.5

200K -180K 100K -20K 0 0


Expected Value of Perfect Information
(EVPI)
• It may be possible to purchase additional
information about future events and thus make a
better decision

– FoxPhone India Co. could hire an economist to analyze


the economy in order to more accurately
determine which economic condition will occur in
the future
• How valuable would this information be?
ask

fav ¬fav
EVPI Computation
• Look first at the decisions under each state of
nature

– If information was available that perfectly predicted


which state of nature was going to occur, the
best decision for that state of nature could be
made

• expected value with perfect information (EV w/ PI):


“the
expected or average return if we have perfect
information
before a decision has to be made”
EVPI Computation
• Perfect information changes environment from
decision making under risk to decision making
with certainty
– Build the large plant if you know for sure that a
favorable market will prevail
– Do nothing if you know for sure that an unfavorable
market will prevail
States of Nature
Favorable Unfavorable
Decision p = 0.5 p = 0.5

Large plant ₹200,000 -₹180,000


Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
ask

fav ¬fav
200K 0

200K 100K 0 -180K -20K 0


EVPI Computation
• Even though perfect information enables FoxPhone
India Co. to make the correct investment decision,
each state of nature occurs only a certain portion
of the time
– A favorable market occurs 50% of the time and an
unfavorable market occurs 50% of the time

– EV w/ PI calculated by choosing the best action for each


state of nature and multiplying its reward times the
probability of occurrence of the state of nature
EVPI Computation
E V w/ PI = (best reward for 1st state of nature)
X (probability of 1st state of nature)
+ (best reward for 2nd state of nature)
X (probability of 2nd state of nature)

E V w/ PI = (₹200,000)(0.5) + (₹0)(0.5) =
₹100,000

States of Nature
Favorable Unfavorable
Decision p = 0.5 p = 0.5

Large plant ₹200,000 -₹180,000


Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
ask
100K

fav 0.5 0.5 ¬fav


200K 0

200K 100K 0 -180K -20K 0


EVPI Computation
• FoxPhone India Co. would be foolish to pay more
for this information than the extra profit that would
be gained from having it
– EVPI: “the maximum amount a decision maker would
pay for additional information resulting in a
decision better than one made without perfect
information ”
• EVPI is the expected outcome with perfect information
minus
the expected outcome without perfect information

EVPI = EV w/ PI - Q
EVPI = ₹100,000 - ₹40,000 = ₹60,000
max(40K,
100K-c)

ask
100K
40K
fav 0.5 0.5 ¬fav
200K 0

200K 100K 0 -180K -20K 0


Using EVPI
• EVPI of ₹60,000 is the maximum amount that
FoxPhone India Co. should pay to purchase
perfect information from a source such as an
economist
– “Perfect” information is extremely rare

– An investor typically would be willing to pay some


amount less than ₹60,000, depending on
how reliable the information is perceived to be
Is Expected Value sufficient?
• Lottery 1
– returns ₹0 always
• Lottery 2
– return ₹100 and -₹100 with prob 0.5

• Which is better?
Is Expected Value sufficient?
• Lottery 1
– returns ₹100 always
• Lottery 2
– return ₹10000 (prob 0.01) and ₹0 with prob 0.99

• Which is better?
– depends
Is Expected Value sufficient?
• Lottery 1
– returns ₹3125 always
• Lottery 2
– return ₹4000 (prob 0.75) and -₹500 with prob 0.25

• Which is better?
Is Expected Value sufficient?
• Lottery 1
– returns ₹0 always
• Lottery 2
– return ₹1,000,000 (prob 0.5) and -₹1,000,000 with
prob 0.5

• Which is better?
Utility Theory
• Adds a layer of utility over rewards
• Risk averse
– |Utility| of high negative money is much MORE
than utility of high positive money
• Risk prone
– Reverse

• Use expected utility criteria…


Utility function of risk-averse agent
Utility

Money

48
Utility function of a risk-prone agent
Utility

Money

50
Utility function of a risk-neutral agent
Utility

Money

51
Markov Decision Processes
Planning Agent
Static vs. Dynamic

Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?

Perfect Instantaneous
vs. vs.
Noisy Durative

Percepts Actions
2
Search
Algorithms Static

Environment

Fully
Observable
Deterministic
What action
next?

Instantaneous
Perfect

Percepts Actions
3
Stochastic Planning: MDPs
Static

Environment

Fully
Observable
Stochastic
What action
next?

Instantaneous
Perfect

Percepts Actions
4
MDP vs. Decision Theory

• Decision theory –
episodic

• MDP -- sequential

5
Markov Decision Process (MDP)

• S: A set of states factored


Factored MDP
• A: A set of actions
• T(s,a,s’): transition
model
• C(s,a,s’): cost model absorbing/
non-absorbing
• sG:
• set ofstate
: start goals
0

• : discount factor
• R(s,a,s’): reward
model

6
Objective of an MDP

• Find a policy : S →
A

• which optimizes
• minimizes expected cost to reach
discounted
a goal or
•• maximizes undiscount. expected
expected reward
(reward-cost)

• given a horizon
• finite
• infinite
• indefinite
• assuming full 7
Role of Discount Factor ()

• Keep the total reward/total cost finite


• useful for infinite horizon problems

• Intuition (economics):
• Money today is worth more than money
tomorrow.

• Total reward: r1 + r2 + 2r3 + …


• Total cost: c1 + c2 + 2c3 + …

8
Examples of MDPs

• Goal-directed, Indefinite Horizon, Cost Minimization


MDP
• <S, A, T, C, G, s0>
• Most often studied in planning, graph theory
communities
• <S, A, T, R, > most popular
• MostHorizon,
• Infinite Discounted
often studied Reward
in machine Maximization
learning, economics,
MDP operations research communities

• Oversubscription Planning: Non absorbing goals, Reward


Max. MDP
• <S, A, T, G, R, s0>
• Relatively recent model

9
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5
0.5
Q R S R S T
T
c c c
c c c c
G G
C(a) = 5, C(b) = 10, Expectimin doesn’t work
C(c) =1 •infinite loop
Expectimin works •V(R/S/T) = 1
•V(Q/R/S/T) = 1 •Q(P,b) = 11
•V(P) = 6 – action a •Q(P,a) = ????
•suppose I decide to take a in P
•Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
10
•= 13.5
Brute force Algorithm

 Go over all
policies ¼ finit
• How many? |A||S| e
 Evaluate each how to
policy
• V¼(s) Ã expected cost of reaching goal from
evaluate?
s

 Choose the best


• We know that best exists (SSP optimality
principle)
• V¼*(s) · V¼(s)
11
Policy Evaluation

 Given a policy ¼: compute V¼


• V¼ : cost of reaching goal while
following ¼

12
Deterministic MDPs

 Policy Graph for ¼


¼(s0) = a0; ¼(s1)
= a1

C=5 C=1
s0 a0
s1 a1
sg

 V¼(s1) =
1
add costs on path to
 V¼(s0) = goal
6
13
Acyclic MDPs

 Policy Graph for


¼
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
a2
Pr=0.4
C=2
s2 C=4 backward pass
in
reverse
 V¼(s1) = 1 orde
topological
 V¼(s2) = 4 r
 V¼(s0) = 0.6(5+1) +
0.4(2+4) = 6 14
General MDPs can be cyclic!

Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2 cannot do a
Pr=0.3
C=3 simple single
pass
 V¼(s1) = 1
 V¼(s2) = ?? (depends on
V¼(s0))
 V¼(s0) = ?? (depends on
15
¼
General SSPs can be cyclic!

Pr=0.6 s1 C=1
C=5
a0 a1
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2
Pr=0.3
C=3 a simple system
of linear
 V¼(g) = 0 equations
 V¼(s1) = 1+V¼(sg) = 1
 V¼(s2) = 0.7(4+V¼(sg)) +
0.3(3+V¼(s0))
 V¼(s0) = 0.6(5+V¼(s1)) + 16
0.4(2+V¼(s ))
Policy Evaluation (Approach 1)

 Solving the System of Linear


Equations
V¼ = 0X if s 2
(s) G s 0 2S
[C(s; ¼(s); + V ¼
= s0) (s0)]

 |S| variables.
 O(|S|3) running
time

17
Iterative Policy Evaluation

Pr=0.6 s1
C=5 C=1 0
a0 a1
4.4+0.4V¼(s2)
0
s0 sg
5.88 Pr=0.4 Pr=0.7
C=2 a2 C=4
6.5856
6.670272
s2
Pr=0.3
6.68043.. C=3
3.7+0.3V¼(s0)
3.7
5.464
5.67568
5.7010816
5.704129…
18
Policy Evaluation (Approach 2)

X
¼
V (s) T (s; ¼(s); s0) [C(s; ¼(s); s0) +
= s 0 2S V (s )]
¼ 0

iterative
refinement

X £
Vn¼(s) Ã ¼(s);00 s
T (s; ) C(s; ¼(s); n¼¡ 1 (s0)
s 0 2S s ) + V ¤

19
Iterative Policy Evaluation

iteration
n

s-
consistenc
y

terminatio
n
condition
Convergence &
Optimality
For a proper policy ¼

Iterative policy evaluation


converges to the true value of the
policy, i.e.
l i m n ! 1 Vn ¼
= V¼
irrespective of the
initialization V0

21
Policy Evaluation  Value Iteration
(Bellman Equations for MDP1)

• <S, A, T, C ,G, s0>


• Define V*(s) {optimal cost} as the
minimum expected cost to reach a goal
from this state.
• V*
V ¤ should
= 0 satisfy the following
Xif s 2
equation:
(s) G mi s 0 2S [C(s; a; + V ¤
= n
a2A s0) (s0)]

Q*(s,a)
V*(s) = mina Q*(s,a)
22
Bellman Equations for MDP2

• <S, A, T, R, s0, >


• Define V*(s) {optimal value} as the
maximum expected discounted reward
from this state.
• V* should satisfy the following
equation:

23
Fixed Point Computation in VI

X
V ¤ (s) = T (s; a; s0) [C(s; a; s0) + V
min s 0 2S (s )]
¤ 0
a2A

iterative
refinement

X
Vn (s) Ã T (s; a; 00s ) [C(s; a; n¡1 (s0
)
min s 0 2S s ) + V ]
a2A

non-
linear 24
Example

a20 a40
a00
s2 s4 C=5
a41
a1 Pr=0.6
s0 a21 a3 C=2 sg
Pr=0.4
a01 s1 s3

25
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+
a41
Pr=0.6
0.6£ 0
a3 C=2 sg + 0.4£ 2
Pr=0.4 = 2.8
s3 min

agreedy =
a40
sg V0=
a41 0

C=5
C=2
s4
V1= 2.8
a41
s3 V0=
2
Value Iteration [Bellman 57]

No restriction on initial value


function

iteration n

s-
consistenc
y

terminatio
n
condition
27
Example
(all actions cost 1 unless otherwise stated)
a20 a40
a00
s2 s4 C=5
a41
a1 Pr=0.6
s0 a21 a3 C=2 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
20 5.99921 5.99921 4.99969 4.99969 3.99969 28
Comments

• Decision-theoretic Algorithm
• Dynamic Programming
• Fixed Point Computation
• Probabilistic version of Bellman-Ford
Algorithm
• for shortest path computation
• MDP1 : Stochastic Shortest Path Problem

 Time Complexity
• one iteration: O(|S|2|A|)
• number of iterations: poly(|S|, |A|, 1/s, 1/1-
)
 Space Complexity: O(|S|)
31
Changing the Search Space

• Value Iteration
• Search in value space
• Compute the resulting
policy

• Policy Iteration
• Search in policy space
• Compute the resulting
value

40
Policy iteration [Howard’60]

• assign an arbitrary assignment of 0 to each


state.
• repea costly: O(n3)
t• Policy Evaluation: compute Vn+1 the
evaluation of n
• Policy Improvement: for all states s
• until• compute
n+1 
n+1(s): argmina2 Ap(s)Qn+1(s,a) approximate
Modified by value iteration
n Policy Iteration using fixed policy
Advantage
uncountably infinite (value) space ⇒ convergence
• searching in a finite (policy) space as opposed to

in fewer number of iterations.


• all other properties 41
Modified Policy iteration

• assign an arbitrary assignment of 0 to each


state.

• repeat
• Policy Evaluation: compute Vn+1 the approx.
evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmina2 Ap(s)Qn+1(s,a)
• until n+1  n

Advantage
• probably the most competitive synchronous
42
dynamic programming algorithm.
VI  Asynchronous VI

 Is backing up all states in an iteration


essential?
• No!

 States may be backed up


• as many times
• in any order

 If no state gets starved


• convergence properties still hold!!
43
Applications
 Stochastic Games
 Robotics: navigation, helicopter manuevers…
 Finance: options, investments
 Communication Networks
 Medicine: Radiation planning for cancer
 Controlling workflows
 Optimize bidding decisions in auctions
 Traffic flow optimization
 Aircraft queueing for landing; airline meal
provisioning
 Optimizing software on mobiles
 Forest firefighting
 … 44
Extensions
 Heuristic Search + Dynamic Programming
• AO*, LAO*, RTDP, …

 Hierarchical MDPs
• hierarchy of sub-tasks, actions to scale
better
 Reinforcement Learning
• learning the probability and rewards
• acting while learning – connections to
psychology
 Partially Observable Markov Decision
Processes
• noisy sensors; partially observable
environment
• popular in robotics 45

You might also like