0% found this document useful (0 votes)

14 views102 pages

Basics RL

Uploaded by

asinghal2122003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views102 pages

Basics RL

Uploaded by

asinghal2122003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 102

Agents & Environments

Outline
• Agents and environments
• Rationality
• PEAS specification
• Environment types
• Agent types

2
Agents
• An agent is anything that can be viewed as perceiving
its environment through sensors and acting upon
that environment through actuators

• Human agent:
– eyes, ears, and other organs for sensors
– hands,legs, mouth, and other body parts for actuators

• Robotic agent:
– cameras and laser range finders for sensors
– various motors for actuators

3
Examples of Agents
• Physically grounded agents
– Intelligent buildings
– Autonomous spacecraft

• Softbots
– Expert Systems
– IBM Watson
Intelligent Agents
• Have sensors, effectors
• Implement mapping from percept
sequence to actions
percepts

Environment Agent

actions

• Performance Measure
Rational Agents
• An agent should strive to do the right thing, based on what
it can perceive and the actions it can perform. The right
action is the one that will cause the agent to be most
successful

• Performance measure: An objective criterion for success of

an agent's behavior

• E.g., performance measure of a vacuum-cleaner agent

could be amount of dirt cleaned up, amount of time taken,
amount of electricity consumed, amount of noise
generated, etc.

6
Ideal Rational Agent
“For each possible percept sequence, does
whatever action is expected to maximize its
performance measure on the basis of evidence
perceived so far and built-in knowledge.''

• Rationality vs omniscience?
• Acting in order to obtain valuable information
Bounded Rationality
• We have a performance measure to optimize
• Given our state of knowledge
• Choose optimal action
• Given limited computational resources
PEAS: Specifying Task Environments
• PEAS: Performance measure, Environment, Actuators,
Sensors

• Must first specify the setting for intelligent agent

design

• Example: the task of designing an automated taxi

driver:
– Performance measure
– Environment
– Actuators
– Sensors
9
PEAS
• Agent: Automated taxi driver

• Performance measure:
– Safe, fast, legal, comfortable trip, maximize profits
• Environment:
– Roads, other traffic, pedestrians, customers
• Actuators:
– Steering wheel, accelerator, brake, signal, horn

• Sensors:
– Cameras, sonar, speedometer, GPS, odometer, engine sensors,
keyboard

10
PEAS
• Agent: Medical diagnosis system
• Performance measure:
– Healthy patient, minimize costs, lawsuits

• Environment:
– Patient, hospital, staff

• Actuators:
– Screen display (questions, tests, diagnoses, treatments,
referrals)
• Sensors:
– (entry of symptoms, findings, patient's answers)

11
Properties of Environments
• Observability: full vs. partial vs. non

• Deterministic vs. stochastic

• Episodic vs. sequential

• Static vs. Semi-dynamic vs. dynamic

• Discrete vs. continuous

• Single Agent vs. Multi Agent (Cooperative, Competitive, Self-Interested)

RoboCup vs. Chess

Deep Blue Robot

• Static/Semi-dynamic  Dynamic
• Deterministic  Stochastic
• Observable  Partially observable
• Discrete  Continuous
• Sequential  Sequential
• Multi-Agent  Multi-Agent
1
4
More Examples
• 15-Puzzle
– Static – Deterministic – Fully Obs – Discrete – Seq – Single

• Poker
– Static – Stochastic – Partially Obs – Discrete – Seq – Multi-agent

• Medical Diagnosis
– Dynamic – Stochastic – Partially Obs – Continuous – Seq – Single

• Taxi Driving
– Dynamic – Stochastic – Partially Obs – Continuous – Seq – Multi-
agent
Simple reflex agents
AGENT Sensors

what world is
like now

ENVIRONMENT
what action
Condition/Action rules should I do now?

Effectors
Reflex agent with internal state
Sensors
What world was like
what world is
like now
How world evolves

ENVIRONMENT
what action
Condition/Action rules should I do now?

AGENT Effectors
Goal-based agents
Sensors
What world was like
what world is
How world evolves like now

ENVIRONMENT
What my actions do what it’ll be like if
I do actions A1-An

what action
Goals should I do now?

AGENT Effectors
Utility-based agents
Sensors
What world was like
what world is
How world evolves like now

ENVIRONMENT
what it’ll be like
What my actions do if I do acts A1-
An
How happy
would I be?
Utility function
what action
should I do now?

AGENT Effectors
Learning agents
Feedback
Sensors
What world was like
what world is
Learn how world evolves like now

ENVIRONMENT
what it’ll be like
Learn what my actions do if I do acts A1-
An
How happy
would I be?
Learn utility function
what action
should I do now?

AGENT Effectors
Fundamentals of Decision Theory
Uncertainty in AI

• Uncertainty comes in many forms

– uncertainty due to another agent’s policy
– uncertainty in outcome of my own action
– uncertainty in my knowledge of the world
– uncertainty in how the world evolves
Decision Theory
• is the study of agent’s choices
• lays out principles for how an agent arrives at
an optimal choice
– Good decisions may occasionally have unexpected
bad outcomes
– it is still a good decision if made properly

– Bad decisions may occasionally have good outcomes

if you are lucky
– it is still a bad decision
Steps in Decision Theory
1. List the possible actions (actions/decisions)

2. Identify the possible outcomes

3. List the payoff or profit or reward

4. Select one of the decision theory models

5. Apply the model and make your decision

Running Example
• Problem.
– The FoxPhone India Co. must decide whether or not to
expand its product line by manufacturing
indigenous smartphones in India

• Step 1: List the possible alternatives/actions

alternative: “a course of action or strategy
that may be chosen by the decision maker”
– (1) Construct a large plant to manufacture the
phones
– (2) Construct a small plant
– (3) Do nothing
The FoxPhone India Co.
• Step 2: Identify the states of nature
– Market for indigenous smartphones could be favorable
• high demand
– Market for indigenous smartphones could be unfavorable
• low demand

state of nature: “an outcome over which

the agent has little or no control”
e.g., lottery, coin-toss, whether it will rain today
The FoxPhone India Co.
• Step 3: List the possible rewards
– A reward for all possible combinations of actions and
states of nature
– Conditional values: “reward depends upon the
action and the state of nature”
• with a favorable market:
– a large plant produces a net profit of ₹200,000
– a small plant produces a net profit of ₹100,000
– no plant produces a net profit of ₹0
• with an unfavorable market:
– a large plant produces a net loss of ₹180,000
– a small plant produces a net loss of ₹20,000
– no plant produces a net profit of ₹0
Reward Tables

States of Nature
Actions

• A means of organizing a decision situation,

including the rewards from different
situations given the possible states of nature
The FoxPhone India Co.

States of Nature
Actions Favorable Market Unfavorable Market
Large plant ₹200,000 -₹180,000
Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
The FoxPhone India Co.
• Steps 4/5: Select an appropriate model
and apply it

– Model selection depends on the operating

environment and degree of uncertainty
Future Uncertainty
• Nondeterministic

• Probabilistic
Non-deterministic Uncertainty

States of Nature
Actions Favorable Market Unfavorable Market
Large plant ₹200,000 -₹180,000
Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0

• What should we do?

Maximax Criterion
“Go for the Gold”

• Select the decision that results in the

maximum of the maximum rewards

• A very optimistic decision criterion

– Decision maker assumes that the most favorable
state of nature for each action will occur

• Most risk prone agent

Maximax
S t a t e s of Nature Maximum
Decision Favorable Unfavorable in Row

Large plant ₹200,0 00 -₹180,0 00 ₹200,0 00

S m a l l plant ₹100,0 00 -₹20,00 0 ₹100,0 00
N o plant ₹0 ₹0 ₹0
• FoxPhone India C o. assumes that the most favorable state
of nature occurs for each decision action
• Select the maximum reward for each decision
– All three maximums occur if a favorable economy
prevails (a tie in case of no plant)
• Select the maximum of the maximums
– Maximum is ₹200,000; corresponding decision is to
build the large plant
– Potential loss of ₹180,000 is completely ignored
Maximin Criterion
“Best of the Worst”

• Select the decision that results in the

maximum of the minimum rewards

• A very pessimistic decision criterion

– Decision maker assumes that the minimum
reward occurs for each decision
action
– Select the maximum of these minimum
rewards

• Most risk averse agent

Maximin
States of Nature Minimum
Decision Favorable Unfavorable in Row

Large plant ₹200,000 -₹180,000 -₹180,000

Small plant ₹100,000 -₹20,000 -₹20,000
No plant ₹0 ₹0 ₹0
• FoxPhone India C o. assumes that the least favorable state
of nature occurs for each decision action
• Select the minimum reward for each decision
– All three minimums occur if an unfavorable economy
prevails (a tie in case of no plant)
• Select the maximum of the minimums
– Maximum is ₹0; corresponding decision is to do
nothing
– A conservative decision; largest possible gain, ₹0, is
much less than maximax
Equal Likelihood Criterion
• Assumes that all states of nature are equally likely to
occur
– Maximax criterion assumed the most favorable state of
nature occurs for each decision
– Maximin criterion assumed the least favorable state of
nature occurs for each decision

• Calculate the average reward for each action and select

the action with the maximum number
– Average reward: the sum of all rewards divided
by the number of states of nature

• Select the decision that gives the highest average

reward
Equal Likelihood
States of Nature Row
Decision Favorable Unfavorable Average

Large plant ₹200,000 -₹180,000 ₹10,000

Small plant ₹100,000 -₹20,000 ₹40,000
No plant ₹0 ₹0 ₹0
Row Averages
$200,000  $180,000
Large Plant   $10,000
2
$100,000  $20,000
S m a l l Plant   $40,000
$0  $0 2
Do Nothing  2  $0

• Select the decision with the highest

weighted value
– Maximum is ₹40,000; corresponding decision is to
Criterion of Realism
• Also known as the weighted average or Hurwicz criterion
– A compromise between an optimistic and pessimistic decision

• A coefficient of realism, , is selected by the decision

maker to indicate optimism or pessimism about the future
0 <  <1
When  is close to 1, the decision maker is optimistic.
When  is close to 0, the decision maker is pessimistic.

• Criterion of realism = (row maximum) + (1-)(row

minimum)
– A weighted average where maximum and minimum rewards
are weighted by  and (1 - ) respectively
Criterion of Realism
• Assume a coefficient of realism equal to 0.8
States of Nature Criterion of
Decision Favorable Unfavorable Realism

Large plant ₹200,000 -₹180,000 ₹124,000

Small plant ₹100,000 -₹20,000 ₹76,000
No plant ₹0 ₹0 ₹0
Weighted Averages
Large Plant = (0.8)(₹200,000) + (0.2)(-₹180,000) = ₹124,000
Small Plant = (0.8)(₹100,000) + (0.2)(-₹20,000) =
Do Nothing ₹76,000

=
Select the decision(0.8)(₹0) + (0.2)(₹0)
with the highest = ₹0 value
weighted
Maximum is ₹124,000; corresponding decision
is to build the large plant
Minimax Regret
• Regret/Opportunity Loss: “the difference
between the optimal reward and the actual
reward received”

• Choose the action that minimizes the maximum

regret associated with each action

– Start by determining the max regret for each action

– Pick the action with the minimum number

Regret Table
• If I knew the future, how much I’d regret my
decision…

• Regret for any state of nature is calculated by

subtracting each outcome in the column
from the best outcome in the same column
Minimax Regret
States of Nature
Favorable Unfavorable Row
Decision Payoff Regret Payoff Regret Maximum

Large plant ₹200,000 ₹0 -₹180,000 ₹180,000 ₹180,000

Small plant ₹100,000 ₹100,000 -₹20,000 ₹20,000 ₹100,000
No plant ₹0 ₹200,000 ₹0 ₹0 ₹200,000
Best payoff ₹200,000 ₹0

• Select the action with the lowest maximum

regret
Minimum is ₹100,000; corresponding
decision is to build a small plant
Summary of Results
Criterion Decision
Maximax Build a large plant
Maximin Do nothing
Equal Build a small plant
likelihoo Build a large plant
d Build a small plant
Realism
Minimax regret
Future Uncertainty
• Non deterministic

• Probabilistic
Probabilistic Uncertainty
• Decision makers know the probability of
occurrence for each possible outcome
– Attempt to maximize the expected reward

• Criteria for decision models in this

environment:
– Maximization of expected reward
– Minimization of expected regret
• Minimize expected regret = maximizing
expected reward!
Expected Reward (Q)
• called Expected Monetary Value (EMV) in DT literature
• “the probability weighted sum of possible rewards for
each action”
– Requires a reward table with conditional rewards and
probability assessments for all states of nature

Q(action a) = (reward of 1st state of nature)

X (probability of 1st state of nature)
+ (reward of 2nd state of nature)
X (probability of 2nd state of nature)
+ . . . + (reward of last state of nature)
X (probability of last state of nature)
The FoxPhone India Co.
• Suppose that the probability of a favorable market is exactly the same as
the probability of an unfavorable market. Which action would give the
greatest Q?

States of Nature
Favorable Mkt Unfavorable Mkt
Decision p = 0.5 p = 0.5 EMV

Large plant ₹200,000 -₹180,000 ₹10,000

Small plant ₹100,000 -₹20,000 ₹40,000
Q(large plant) = (0.5)(₹200,000)
No plant ₹0 + (0.5)(-₹180,000)
₹0 = ₹10,000 ₹0
Q(small plant) = (0.5)(₹100,000) + (0.5)(-₹-20,000) =
₹40,000 (0.5)(₹0) + (0.5)(₹0) =
Q(no plant) =
₹0
Build the small plant
small

40K

small

10K 40K 0

0.5 0.5 0.5 0.5 0.5 0.5

200K -180K 100K -20K 0 0

Expected Value of Perfect Information
(EVPI)
• It may be possible to purchase additional
information about future events and thus make a
better decision

– FoxPhone India Co. could hire an economist to analyze

the economy in order to more accurately
determine which economic condition will occur in
the future
• How valuable would this information be?
ask

fav ¬fav
EVPI Computation
• Look first at the decisions under each state of
nature

– If information was available that perfectly predicted

which state of nature was going to occur, the
best decision for that state of nature could be
made

• expected value with perfect information (EV w/ PI):

“the
expected or average return if we have perfect
information
before a decision has to be made”
EVPI Computation
• Perfect information changes environment from
decision making under risk to decision making
with certainty
– Build the large plant if you know for sure that a
favorable market will prevail
– Do nothing if you know for sure that an unfavorable
market will prevail
States of Nature
Favorable Unfavorable
Decision p = 0.5 p = 0.5

Large plant ₹200,000 -₹180,000

Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
ask

fav ¬fav
200K 0

200K 100K 0 -180K -20K 0

EVPI Computation
• Even though perfect information enables FoxPhone
India Co. to make the correct investment decision,
each state of nature occurs only a certain portion
of the time
– A favorable market occurs 50% of the time and an
unfavorable market occurs 50% of the time

– EV w/ PI calculated by choosing the best action for each

state of nature and multiplying its reward times the
probability of occurrence of the state of nature
EVPI Computation
E V w/ PI = (best reward for 1st state of nature)
X (probability of 1st state of nature)
+ (best reward for 2nd state of nature)
X (probability of 2nd state of nature)

E V w/ PI = (₹200,000)(0.5) + (₹0)(0.5) =
₹100,000

States of Nature
Favorable Unfavorable
Decision p = 0.5 p = 0.5

Large plant ₹200,000 -₹180,000

Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
ask
100K

fav 0.5 0.5 ¬fav

200K 0

200K 100K 0 -180K -20K 0

EVPI Computation
• FoxPhone India Co. would be foolish to pay more
for this information than the extra profit that would
be gained from having it
– EVPI: “the maximum amount a decision maker would
pay for additional information resulting in a
decision better than one made without perfect
information ”
• EVPI is the expected outcome with perfect information
minus
the expected outcome without perfect information

EVPI = EV w/ PI - Q
EVPI = ₹100,000 - ₹40,000 = ₹60,000
max(40K,
100K-c)

ask
100K
40K
fav 0.5 0.5 ¬fav
200K 0

200K 100K 0 -180K -20K 0

Using EVPI
• EVPI of ₹60,000 is the maximum amount that
FoxPhone India Co. should pay to purchase
perfect information from a source such as an
economist
– “Perfect” information is extremely rare

– An investor typically would be willing to pay some

amount less than ₹60,000, depending on
how reliable the information is perceived to be
Is Expected Value sufficient?
• Lottery 1
– returns ₹0 always
• Lottery 2
– return ₹100 and -₹100 with prob 0.5

• Which is better?
Is Expected Value sufficient?
• Lottery 1
– returns ₹100 always
• Lottery 2
– return ₹10000 (prob 0.01) and ₹0 with prob 0.99

• Which is better?
– depends
Is Expected Value sufficient?
• Lottery 1
– returns ₹3125 always
• Lottery 2
– return ₹4000 (prob 0.75) and -₹500 with prob 0.25

• Which is better?
Is Expected Value sufficient?
• Lottery 1
– returns ₹0 always
• Lottery 2
– return ₹1,000,000 (prob 0.5) and -₹1,000,000 with
prob 0.5

• Which is better?
Utility Theory
• Adds a layer of utility over rewards
• Risk averse
– |Utility| of high negative money is much MORE
than utility of high positive money
• Risk prone
– Reverse

• Use expected utility criteria…

Utility function of risk-averse agent
Utility

Money

48
Utility function of a risk-prone agent
Utility

Money

50
Utility function of a risk-neutral agent
Utility

Money

51
Markov Decision Processes
Planning Agent
Static vs. Dynamic

Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?

Perfect Instantaneous
vs. vs.
Noisy Durative

Percepts Actions
2
Search
Algorithms Static

Environment

Fully
Observable
Deterministic
What action
next?

Instantaneous
Perfect

Percepts Actions
3
Stochastic Planning: MDPs
Static

Environment

Fully
Observable
Stochastic
What action
next?

Instantaneous
Perfect

Percepts Actions
4
MDP vs. Decision Theory

• Decision theory –
episodic

• MDP -- sequential

5
Markov Decision Process (MDP)

• S: A set of states factored

Factored MDP
• A: A set of actions
• T(s,a,s’): transition
model
• C(s,a,s’): cost model absorbing/
non-absorbing
• sG:
• set ofstate
: start goals
0

• : discount factor
• R(s,a,s’): reward
model

6
Objective of an MDP

• Find a policy : S →
A

• which optimizes
• minimizes expected cost to reach
discounted
a goal or
•• maximizes undiscount. expected
expected reward
(reward-cost)

• given a horizon
• finite
• infinite
• indefinite
• assuming full 7
Role of Discount Factor ()

• Keep the total reward/total cost finite

• useful for infinite horizon problems

• Intuition (economics):
• Money today is worth more than money
tomorrow.

• Total reward: r1 + r2 + 2r3 + …

• Total cost: c1 + c2 + 2c3 + …

8
Examples of MDPs

• Goal-directed, Indefinite Horizon, Cost Minimization

MDP
• <S, A, T, C, G, s0>
• Most often studied in planning, graph theory
communities
• <S, A, T, R, > most popular
• MostHorizon,
• Infinite Discounted
often studied Reward
in machine Maximization
learning, economics,
MDP operations research communities

• Oversubscription Planning: Non absorbing goals, Reward

Max. MDP
• <S, A, T, G, R, s0>
• Relatively recent model

9
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5
0.5
Q R S R S T
T
c c c
c c c c
G G
C(a) = 5, C(b) = 10, Expectimin doesn’t work
C(c) =1 •infinite loop
Expectimin works •V(R/S/T) = 1
•V(Q/R/S/T) = 1 •Q(P,b) = 11
•V(P) = 6 – action a •Q(P,a) = ????
•suppose I decide to take a in P
•Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
10
•= 13.5
Brute force Algorithm

 Go over all
policies ¼ finit
• How many? |A||S| e
 Evaluate each how to
policy
• V¼(s) Ã expected cost of reaching goal from
evaluate?
s

 Choose the best

• We know that best exists (SSP optimality
principle)
• V¼*(s) · V¼(s)
11
Policy Evaluation

 Given a policy ¼: compute V¼

• V¼ : cost of reaching goal while
following ¼

12
Deterministic MDPs

 Policy Graph for ¼

¼(s0) = a0; ¼(s1)
= a1

C=5 C=1
s0 a0
s1 a1
sg

 V¼(s1) =
1
add costs on path to
 V¼(s0) = goal
6
13
Acyclic MDPs

 Policy Graph for

¼
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
a2
Pr=0.4
C=2
s2 C=4 backward pass
in
reverse
 V¼(s1) = 1 orde
topological
 V¼(s2) = 4 r
 V¼(s0) = 0.6(5+1) +
0.4(2+4) = 6 14
General MDPs can be cyclic!

Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2 cannot do a
Pr=0.3
C=3 simple single
pass
 V¼(s1) = 1
 V¼(s2) = ?? (depends on
V¼(s0))
 V¼(s0) = ?? (depends on
15
¼
General SSPs can be cyclic!

Pr=0.6 s1 C=1
C=5
a0 a1
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2
Pr=0.3
C=3 a simple system
of linear
 V¼(g) = 0 equations
 V¼(s1) = 1+V¼(sg) = 1
 V¼(s2) = 0.7(4+V¼(sg)) +
0.3(3+V¼(s0))
 V¼(s0) = 0.6(5+V¼(s1)) + 16
0.4(2+V¼(s ))
Policy Evaluation (Approach 1)

 Solving the System of Linear

Equations
V¼ = 0X if s 2
(s) G s 0 2S
[C(s; ¼(s); + V ¼
= s0) (s0)]

 |S| variables.
 O(|S|3) running
time

17
Iterative Policy Evaluation

Pr=0.6 s1
C=5 C=1 0
a0 a1
4.4+0.4V¼(s2)
0
s0 sg
5.88 Pr=0.4 Pr=0.7
C=2 a2 C=4
6.5856
6.670272
s2
Pr=0.3
6.68043.. C=3
3.7+0.3V¼(s0)
3.7
5.464
5.67568
5.7010816
5.704129…
18
Policy Evaluation (Approach 2)

X
¼
V (s) T (s; ¼(s); s0) [C(s; ¼(s); s0) +
= s 0 2S V (s )]
¼ 0

iterative
refinement

X £
Vn¼(s) Ã ¼(s);00 s
T (s; ) C(s; ¼(s); n¼¡ 1 (s0)
s 0 2S s ) + V ¤

19
Iterative Policy Evaluation

iteration
n

s-
consistenc
y

terminatio
n
condition
Convergence &
Optimality
For a proper policy ¼

Iterative policy evaluation

converges to the true value of the
policy, i.e.
l i m n ! 1 Vn ¼
= V¼
irrespective of the
initialization V0

21
Policy Evaluation  Value Iteration
(Bellman Equations for MDP1)

• <S, A, T, C ,G, s0>

• Define V*(s) {optimal cost} as the
minimum expected cost to reach a goal
from this state.
• V*
V ¤ should
= 0 satisfy the following
Xif s 2
equation:
(s) G mi s 0 2S [C(s; a; + V ¤
= n
a2A s0) (s0)]

Q*(s,a)
V*(s) = mina Q*(s,a)
22
Bellman Equations for MDP2

• <S, A, T, R, s0, >

• Define V*(s) {optimal value} as the
maximum expected discounted reward
from this state.
• V* should satisfy the following
equation:

23
Fixed Point Computation in VI

X
V ¤ (s) = T (s; a; s0) [C(s; a; s0) + V
min s 0 2S (s )]
¤ 0
a2A

iterative
refinement

X
Vn (s) Ã T (s; a; 00s ) [C(s; a; n¡1 (s0
)
min s 0 2S s ) + V ]
a2A

non-
linear 24
Example

a20 a40
a00
s2 s4 C=5
a41
a1 Pr=0.6
s0 a21 a3 C=2 sg
Pr=0.4
a01 s1 s3

25
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+
a41
Pr=0.6
0.6£ 0
a3 C=2 sg + 0.4£ 2
Pr=0.4 = 2.8
s3 min

agreedy =
a40
sg V0=
a41 0

C=5
C=2
s4
V1= 2.8
a41
s3 V0=
2
Value Iteration [Bellman 57]

No restriction on initial value

function

iteration n

s-
consistenc
y

terminatio
n
condition
27
Example
(all actions cost 1 unless otherwise stated)
a20 a40
a00
s2 s4 C=5
a41
a1 Pr=0.6
s0 a21 a3 C=2 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
20 5.99921 5.99921 4.99969 4.99969 3.99969 28
Comments

• Decision-theoretic Algorithm
• Dynamic Programming
• Fixed Point Computation
• Probabilistic version of Bellman-Ford
Algorithm
• for shortest path computation
• MDP1 : Stochastic Shortest Path Problem

 Time Complexity
• one iteration: O(|S|2|A|)
• number of iterations: poly(|S|, |A|, 1/s, 1/1-
)
 Space Complexity: O(|S|)
31
Changing the Search Space

• Value Iteration
• Search in value space
• Compute the resulting
policy

• Policy Iteration
• Search in policy space
• Compute the resulting
value

40
Policy iteration [Howard’60]

• assign an arbitrary assignment of 0 to each

state.
• repea costly: O(n3)
t• Policy Evaluation: compute Vn+1 the
evaluation of n
• Policy Improvement: for all states s
• until• compute
n+1 
n+1(s): argmina2 Ap(s)Qn+1(s,a) approximate
Modified by value iteration
n Policy Iteration using fixed policy
Advantage
uncountably infinite (value) space ⇒ convergence
• searching in a finite (policy) space as opposed to

in fewer number of iterations.

• all other properties 41
Modified Policy iteration

• assign an arbitrary assignment of 0 to each

state.

• repeat
• Policy Evaluation: compute Vn+1 the approx.
evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmina2 Ap(s)Qn+1(s,a)
• until n+1  n

Advantage
• probably the most competitive synchronous
42
dynamic programming algorithm.
VI  Asynchronous VI

 Is backing up all states in an iteration

essential?
• No!

 States may be backed up

• as many times
• in any order

 If no state gets starved

• convergence properties still hold!!
43
Applications
 Stochastic Games
 Robotics: navigation, helicopter manuevers…
 Finance: options, investments
 Communication Networks
 Medicine: Radiation planning for cancer
 Controlling workflows
 Optimize bidding decisions in auctions
 Traffic flow optimization
 Aircraft queueing for landing; airline meal
provisioning
 Optimizing software on mobiles
 Forest firefighting
 … 44
Extensions
 Heuristic Search + Dynamic Programming
• AO*, LAO*, RTDP, …

 Hierarchical MDPs
• hierarchy of sub-tasks, actions to scale
better
 Reinforcement Learning
• learning the probability and rewards
• acting while learning – connections to
psychology
 Partially Observable Markov Decision
Processes
• noisy sensors; partially observable
environment
• popular in robotics 45

AI Unit I
No ratings yet
AI Unit I
46 pages
2023 Slide2 Agents Eng
No ratings yet
2023 Slide2 Agents Eng
35 pages
Lecture4-5 6
No ratings yet
Lecture4-5 6
82 pages
CX1015 Lecture 2 - Intelligent Agents
No ratings yet
CX1015 Lecture 2 - Intelligent Agents
32 pages
2.agent Search and Game Playing
No ratings yet
2.agent Search and Game Playing
35 pages
2023 Lecture02 IntelligentAgents
No ratings yet
2023 Lecture02 IntelligentAgents
65 pages
AIT307 AI Unit1 2 Agents
No ratings yet
AIT307 AI Unit1 2 Agents
66 pages
AI Agents and Environment
No ratings yet
AI Agents and Environment
26 pages
Lecture - 3 - 6
No ratings yet
Lecture - 3 - 6
23 pages
Chap 2
No ratings yet
Chap 2
23 pages
CH 2 Intelligent Agents
No ratings yet
CH 2 Intelligent Agents
29 pages
Lec03 Agents-1689184299238
No ratings yet
Lec03 Agents-1689184299238
18 pages
CoSc4142 AI Chapter 2
No ratings yet
CoSc4142 AI Chapter 2
33 pages
Unit-1 Part-1 Aiml
No ratings yet
Unit-1 Part-1 Aiml
28 pages
Ai1 2
No ratings yet
Ai1 2
41 pages
Intro AI
No ratings yet
Intro AI
154 pages
CSC 462 - AI-Week 01
No ratings yet
CSC 462 - AI-Week 01
30 pages
4-Module-1-Intelligent Agents-Structure of Intelligent Agents-Environments-09-01-2024
No ratings yet
4-Module-1-Intelligent Agents-Structure of Intelligent Agents-Environments-09-01-2024
41 pages
AI Applications
No ratings yet
AI Applications
78 pages
2.agent Search and Game Playing
No ratings yet
2.agent Search and Game Playing
37 pages
Chapter 2
No ratings yet
Chapter 2
20 pages
AI Introduction
No ratings yet
AI Introduction
17 pages
Intelligent Agents: Russell & Norvig
No ratings yet
Intelligent Agents: Russell & Norvig
24 pages
Chapter 2 - Intelligent Agents
No ratings yet
Chapter 2 - Intelligent Agents
19 pages
AI Agents
No ratings yet
AI Agents
52 pages
W2-Intelligent Agents
No ratings yet
W2-Intelligent Agents
36 pages
2 Intelligent Agents
No ratings yet
2 Intelligent Agents
29 pages
Chapter 2
No ratings yet
Chapter 2
38 pages
Lahore Garrison University: Course Syllabus Description
No ratings yet
Lahore Garrison University: Course Syllabus Description
4 pages
AI Environment
No ratings yet
AI Environment
5 pages
MODULE 1 Chapter02
No ratings yet
MODULE 1 Chapter02
59 pages
Topic 2 Agents
No ratings yet
Topic 2 Agents
33 pages
Cse 4109ch2 Peas
No ratings yet
Cse 4109ch2 Peas
61 pages
WEEK2 Intelligent Agents
No ratings yet
WEEK2 Intelligent Agents
8 pages
Lecture 2 - Agents
No ratings yet
Lecture 2 - Agents
45 pages
AI Week 2-3
No ratings yet
AI Week 2-3
53 pages
AI-Lecture 03
No ratings yet
AI-Lecture 03
33 pages
Intelligent Agents
No ratings yet
Intelligent Agents
60 pages
Artificial Intelligence Lecture Notes PDF
67% (6)
Artificial Intelligence Lecture Notes PDF
56 pages
Unit 2 Fai
No ratings yet
Unit 2 Fai
11 pages
AAI - Intro Lec 3 4
No ratings yet
AAI - Intro Lec 3 4
29 pages
Introduction To Artificial Intelligence: The Ability To Solve Problems
No ratings yet
Introduction To Artificial Intelligence: The Ability To Solve Problems
11 pages
CH 1 2 Agents Norvig Book
No ratings yet
CH 1 2 Agents Norvig Book
50 pages
Intelligent Agents & Its Types
No ratings yet
Intelligent Agents & Its Types
24 pages
Intelligent Agent
No ratings yet
Intelligent Agent
59 pages
Lecture 2 - Agents
No ratings yet
Lecture 2 - Agents
45 pages
Lecture 1 - Intelligent Agent
No ratings yet
Lecture 1 - Intelligent Agent
11 pages
1 Sampling and Signal Reconstruction
No ratings yet
1 Sampling and Signal Reconstruction
15 pages
AI Intro1
No ratings yet
AI Intro1
81 pages
Lecture+Set+02 Agents+and+Environments
No ratings yet
Lecture+Set+02 Agents+and+Environments
41 pages
Lesson 10001
No ratings yet
Lesson 10001
29 pages
AI - Module 2
No ratings yet
AI - Module 2
25 pages
AI Chapter 2 Agents
No ratings yet
AI Chapter 2 Agents
28 pages
Intelligent Agents-1
No ratings yet
Intelligent Agents-1
29 pages
Topic 3: Intelligent Agents Intelligent Agents:Overview
No ratings yet
Topic 3: Intelligent Agents Intelligent Agents:Overview
6 pages
CMSC 671 Fall 2001: Class #2 - Tuesday, September 4
No ratings yet
CMSC 671 Fall 2001: Class #2 - Tuesday, September 4
34 pages
Intelligent Agents
No ratings yet
Intelligent Agents
27 pages
Module 2
No ratings yet
Module 2
78 pages
Agents: Aiza Shabir Lecturer Institute of CS&IT The Women University, Multan
No ratings yet
Agents: Aiza Shabir Lecturer Institute of CS&IT The Women University, Multan
30 pages
Lect 04
No ratings yet
Lect 04
5 pages
Asynchronous Sequential Circuits: E&CE223 Digital Circuits and Systems
100% (1)
Asynchronous Sequential Circuits: E&CE223 Digital Circuits and Systems
45 pages
Chap 2 PDF
No ratings yet
Chap 2 PDF
9 pages
SRM Institute of Science and Technology: Artificial Intelligence Is About
No ratings yet
SRM Institute of Science and Technology: Artificial Intelligence Is About
7 pages
LMI Control Toolbox
0% (1)
LMI Control Toolbox
356 pages
CS3381 Oop Manual Cse
No ratings yet
CS3381 Oop Manual Cse
52 pages
TIC 2151 - Theory of Computation: Context-Free Grammars (CFG)
No ratings yet
TIC 2151 - Theory of Computation: Context-Free Grammars (CFG)
23 pages
Thesis On Content Based Image Retrieval
100% (3)
Thesis On Content Based Image Retrieval
7 pages
A New Methodology To Predict Backbreak in Blasting Operation-M. Mohammadnejad
No ratings yet
A New Methodology To Predict Backbreak in Blasting Operation-M. Mohammadnejad
7 pages
Control of Level of A Conical Tank System Aim
No ratings yet
Control of Level of A Conical Tank System Aim
3 pages
Accuracy, Recall, Precision, F-Score & Specificity, Which To Optimize On
No ratings yet
Accuracy, Recall, Precision, F-Score & Specificity, Which To Optimize On
10 pages
AD3411 DATA SCIENCE AND ANALYTICS LAB (2) - Removed
No ratings yet
AD3411 DATA SCIENCE AND ANALYTICS LAB (2) - Removed
24 pages
Unit1 1SS1
No ratings yet
Unit1 1SS1
20 pages
Lecture 5
100% (1)
Lecture 5
39 pages
Windowing Functions Improve FFT Results,: Richard Lyons
No ratings yet
Windowing Functions Improve FFT Results,: Richard Lyons
7 pages
Updating Weight
No ratings yet
Updating Weight
9 pages
DSA 3rd Sem Solution MCQ and 1 Word
No ratings yet
DSA 3rd Sem Solution MCQ and 1 Word
7 pages
Fake Jobs Code
No ratings yet
Fake Jobs Code
3 pages
Copia de Modeling Microwave Popcorn
No ratings yet
Copia de Modeling Microwave Popcorn
14 pages
Support Vector Machine
No ratings yet
Support Vector Machine
21 pages
Cheetah: Optimizing and Accelerating Homomorphic Encryption For Private Inference
No ratings yet
Cheetah: Optimizing and Accelerating Homomorphic Encryption For Private Inference
14 pages
A Random Forest-Based Classification Method For Prediction of Car Price
No ratings yet
A Random Forest-Based Classification Method For Prediction of Car Price
1 page
The Poor Cartographer-Graph: Coloring
No ratings yet
The Poor Cartographer-Graph: Coloring
13 pages
Duality&Sensitivity PDF
No ratings yet
Duality&Sensitivity PDF
4 pages
Bayesian Radar
No ratings yet
Bayesian Radar
6 pages
Scholkopf Kernel PDF
No ratings yet
Scholkopf Kernel PDF
6 pages
An Effective Approach For Pneumonia Detection Using Convolution Vision Transformer
No ratings yet
An Effective Approach For Pneumonia Detection Using Convolution Vision Transformer
6 pages
Unit 12 Unit Test
No ratings yet
Unit 12 Unit Test
7 pages
A Study On Heston-Nandi GARCH Option Pricing Model: Abstract
No ratings yet
A Study On Heston-Nandi GARCH Option Pricing Model: Abstract
5 pages
Android Sensor Programming By Example
From Everand
Android Sensor Programming By Example
Varun Nagpal
No ratings yet
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet