Basics RL
Basics RL
Outline
• Agents and environments
• Rationality
• PEAS specification
• Environment types
• Agent types
2
Agents
• An agent is anything that can be viewed as perceiving
its environment through sensors and acting upon
that environment through actuators
• Human agent:
– eyes, ears, and other organs for sensors
– hands,legs, mouth, and other body parts for actuators
• Robotic agent:
– cameras and laser range finders for sensors
– various motors for actuators
3
Examples of Agents
• Physically grounded agents
– Intelligent buildings
– Autonomous spacecraft
• Softbots
– Expert Systems
– IBM Watson
Intelligent Agents
• Have sensors, effectors
• Implement mapping from percept
sequence to actions
percepts
Environment Agent
actions
• Performance Measure
Rational Agents
• An agent should strive to do the right thing, based on what
it can perceive and the actions it can perform. The right
action is the one that will cause the agent to be most
successful
6
Ideal Rational Agent
“For each possible percept sequence, does
whatever action is expected to maximize its
performance measure on the basis of evidence
perceived so far and built-in knowledge.''
• Rationality vs omniscience?
• Acting in order to obtain valuable information
Bounded Rationality
• We have a performance measure to optimize
• Given our state of knowledge
• Choose optimal action
• Given limited computational resources
PEAS: Specifying Task Environments
• PEAS: Performance measure, Environment, Actuators,
Sensors
• Performance measure:
– Safe, fast, legal, comfortable trip, maximize profits
• Environment:
– Roads, other traffic, pedestrians, customers
• Actuators:
– Steering wheel, accelerator, brake, signal, horn
• Sensors:
– Cameras, sonar, speedometer, GPS, odometer, engine sensors,
keyboard
10
PEAS
• Agent: Medical diagnosis system
• Performance measure:
– Healthy patient, minimize costs, lawsuits
• Environment:
– Patient, hospital, staff
• Actuators:
– Screen display (questions, tests, diagnoses, treatments,
referrals)
• Sensors:
– (entry of symptoms, findings, patient's answers)
11
Properties of Environments
• Observability: full vs. partial vs. non
• Static/Semi-dynamic Dynamic
• Deterministic Stochastic
• Observable Partially observable
• Discrete Continuous
• Sequential Sequential
• Multi-Agent Multi-Agent
1
4
More Examples
• 15-Puzzle
– Static – Deterministic – Fully Obs – Discrete – Seq – Single
• Poker
– Static – Stochastic – Partially Obs – Discrete – Seq – Multi-agent
• Medical Diagnosis
– Dynamic – Stochastic – Partially Obs – Continuous – Seq – Single
• Taxi Driving
– Dynamic – Stochastic – Partially Obs – Continuous – Seq – Multi-
agent
Simple reflex agents
AGENT Sensors
what world is
like now
ENVIRONMENT
what action
Condition/Action rules should I do now?
Effectors
Reflex agent with internal state
Sensors
What world was like
what world is
like now
How world evolves
ENVIRONMENT
what action
Condition/Action rules should I do now?
AGENT Effectors
Goal-based agents
Sensors
What world was like
what world is
How world evolves like now
ENVIRONMENT
What my actions do what it’ll be like if
I do actions A1-An
what action
Goals should I do now?
AGENT Effectors
Utility-based agents
Sensors
What world was like
what world is
How world evolves like now
ENVIRONMENT
what it’ll be like
What my actions do if I do acts A1-
An
How happy
would I be?
Utility function
what action
should I do now?
AGENT Effectors
Learning agents
Feedback
Sensors
What world was like
what world is
Learn how world evolves like now
ENVIRONMENT
what it’ll be like
Learn what my actions do if I do acts A1-
An
How happy
would I be?
Learn utility function
what action
should I do now?
AGENT Effectors
Fundamentals of Decision Theory
Uncertainty in AI
States of Nature
Actions
States of Nature
Actions Favorable Market Unfavorable Market
Large plant ₹200,000 -₹180,000
Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
The FoxPhone India Co.
• Steps 4/5: Select an appropriate model
and apply it
• Probabilistic
Non-deterministic Uncertainty
States of Nature
Actions Favorable Market Unfavorable Market
Large plant ₹200,000 -₹180,000
Small plant ₹100,000 -₹20,000
No plant ₹0 ₹0
=
Select the decision(0.8)(₹0) + (0.2)(₹0)
with the highest = ₹0 value
weighted
Maximum is ₹124,000; corresponding decision
is to build the large plant
Minimax Regret
• Regret/Opportunity Loss: “the difference
between the optimal reward and the actual
reward received”
• Probabilistic
Probabilistic Uncertainty
• Decision makers know the probability of
occurrence for each possible outcome
– Attempt to maximize the expected reward
States of Nature
Favorable Mkt Unfavorable Mkt
Decision p = 0.5 p = 0.5 EMV
40K
small
10K 40K 0
fav ¬fav
EVPI Computation
• Look first at the decisions under each state of
nature
fav ¬fav
200K 0
E V w/ PI = (₹200,000)(0.5) + (₹0)(0.5) =
₹100,000
States of Nature
Favorable Unfavorable
Decision p = 0.5 p = 0.5
EVPI = EV w/ PI - Q
EVPI = ₹100,000 - ₹40,000 = ₹60,000
max(40K,
100K-c)
ask
100K
40K
fav 0.5 0.5 ¬fav
200K 0
• Which is better?
Is Expected Value sufficient?
• Lottery 1
– returns ₹100 always
• Lottery 2
– return ₹10000 (prob 0.01) and ₹0 with prob 0.99
• Which is better?
– depends
Is Expected Value sufficient?
• Lottery 1
– returns ₹3125 always
• Lottery 2
– return ₹4000 (prob 0.75) and -₹500 with prob 0.25
• Which is better?
Is Expected Value sufficient?
• Lottery 1
– returns ₹0 always
• Lottery 2
– return ₹1,000,000 (prob 0.5) and -₹1,000,000 with
prob 0.5
• Which is better?
Utility Theory
• Adds a layer of utility over rewards
• Risk averse
– |Utility| of high negative money is much MORE
than utility of high positive money
• Risk prone
– Reverse
Money
48
Utility function of a risk-prone agent
Utility
Money
50
Utility function of a risk-neutral agent
Utility
Money
51
Markov Decision Processes
Planning Agent
Static vs. Dynamic
Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?
Perfect Instantaneous
vs. vs.
Noisy Durative
Percepts Actions
2
Search
Algorithms Static
Environment
Fully
Observable
Deterministic
What action
next?
Instantaneous
Perfect
Percepts Actions
3
Stochastic Planning: MDPs
Static
Environment
Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect
Percepts Actions
4
MDP vs. Decision Theory
• Decision theory –
episodic
• MDP -- sequential
5
Markov Decision Process (MDP)
• : discount factor
• R(s,a,s’): reward
model
6
Objective of an MDP
• Find a policy : S →
A
• which optimizes
• minimizes expected cost to reach
discounted
a goal or
•• maximizes undiscount. expected
expected reward
(reward-cost)
• given a horizon
• finite
• infinite
• indefinite
• assuming full 7
Role of Discount Factor ()
• Intuition (economics):
• Money today is worth more than money
tomorrow.
8
Examples of MDPs
9
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5
0.5
Q R S R S T
T
c c c
c c c c
G G
C(a) = 5, C(b) = 10, Expectimin doesn’t work
C(c) =1 •infinite loop
Expectimin works •V(R/S/T) = 1
•V(Q/R/S/T) = 1 •Q(P,b) = 11
•V(P) = 6 – action a •Q(P,a) = ????
•suppose I decide to take a in P
•Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
10
•= 13.5
Brute force Algorithm
Go over all
policies ¼ finit
• How many? |A||S| e
Evaluate each how to
policy
• V¼(s) Ã expected cost of reaching goal from
evaluate?
s
12
Deterministic MDPs
C=5 C=1
s0 a0
s1 a1
sg
V¼(s1) =
1
add costs on path to
V¼(s0) = goal
6
13
Acyclic MDPs
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2 cannot do a
Pr=0.3
C=3 simple single
pass
V¼(s1) = 1
V¼(s2) = ?? (depends on
V¼(s0))
V¼(s0) = ?? (depends on
15
¼
General SSPs can be cyclic!
Pr=0.6 s1 C=1
C=5
a0 a1
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2
Pr=0.3
C=3 a simple system
of linear
V¼(g) = 0 equations
V¼(s1) = 1+V¼(sg) = 1
V¼(s2) = 0.7(4+V¼(sg)) +
0.3(3+V¼(s0))
V¼(s0) = 0.6(5+V¼(s1)) + 16
0.4(2+V¼(s ))
Policy Evaluation (Approach 1)
|S| variables.
O(|S|3) running
time
17
Iterative Policy Evaluation
Pr=0.6 s1
C=5 C=1 0
a0 a1
4.4+0.4V¼(s2)
0
s0 sg
5.88 Pr=0.4 Pr=0.7
C=2 a2 C=4
6.5856
6.670272
s2
Pr=0.3
6.68043.. C=3
3.7+0.3V¼(s0)
3.7
5.464
5.67568
5.7010816
5.704129…
18
Policy Evaluation (Approach 2)
X
¼
V (s) T (s; ¼(s); s0) [C(s; ¼(s); s0) +
= s 0 2S V (s )]
¼ 0
iterative
refinement
X £
Vn¼(s) Ã ¼(s);00 s
T (s; ) C(s; ¼(s); n¼¡ 1 (s0)
s 0 2S s ) + V ¤
19
Iterative Policy Evaluation
iteration
n
s-
consistenc
y
terminatio
n
condition
Convergence &
Optimality
For a proper policy ¼
21
Policy Evaluation Value Iteration
(Bellman Equations for MDP1)
Q*(s,a)
V*(s) = mina Q*(s,a)
22
Bellman Equations for MDP2
23
Fixed Point Computation in VI
X
V ¤ (s) = T (s; a; s0) [C(s; a; s0) + V
min s 0 2S (s )]
¤ 0
a2A
iterative
refinement
X
Vn (s) Ã T (s; a; 00s ) [C(s; a; n¡1 (s0
)
min s 0 2S s ) + V ]
a2A
non-
linear 24
Example
a20 a40
a00
s2 s4 C=5
a41
a1 Pr=0.6
s0 a21 a3 C=2 sg
Pr=0.4
a01 s1 s3
25
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+
a41
Pr=0.6
0.6£ 0
a3 C=2 sg + 0.4£ 2
Pr=0.4 = 2.8
s3 min
agreedy =
a40
sg V0=
a41 0
C=5
C=2
s4
V1= 2.8
a41
s3 V0=
2
Value Iteration [Bellman 57]
iteration n
s-
consistenc
y
terminatio
n
condition
27
Example
(all actions cost 1 unless otherwise stated)
a20 a40
a00
s2 s4 C=5
a41
a1 Pr=0.6
s0 a21 a3 C=2 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
20 5.99921 5.99921 4.99969 4.99969 3.99969 28
Comments
• Decision-theoretic Algorithm
• Dynamic Programming
• Fixed Point Computation
• Probabilistic version of Bellman-Ford
Algorithm
• for shortest path computation
• MDP1 : Stochastic Shortest Path Problem
Time Complexity
• one iteration: O(|S|2|A|)
• number of iterations: poly(|S|, |A|, 1/s, 1/1-
)
Space Complexity: O(|S|)
31
Changing the Search Space
• Value Iteration
• Search in value space
• Compute the resulting
policy
• Policy Iteration
• Search in policy space
• Compute the resulting
value
40
Policy iteration [Howard’60]
• repeat
• Policy Evaluation: compute Vn+1 the approx.
evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmina2 Ap(s)Qn+1(s,a)
• until n+1 n
Advantage
• probably the most competitive synchronous
42
dynamic programming algorithm.
VI Asynchronous VI
Hierarchical MDPs
• hierarchy of sub-tasks, actions to scale
better
Reinforcement Learning
• learning the probability and rewards
• acting while learning – connections to
psychology
Partially Observable Markov Decision
Processes
• noisy sensors; partially observable
environment
• popular in robotics 45