0% found this document useful (0 votes)
25 views51 pages

12 Reinforcement Learning Full

Okay, let's break this down step-by-step: VA = 0.6×160 + 0.2×VB + 0.2×VS VB = 0.2×VF + 0.2×VS + 0.6×VB VF = 0.3×VD + 0.7×VF VS = 0.3×VD + 0.7×VS VD = 0 We can set up a system of equations and solve for VA, VB, VF, VS, VD.

Uploaded by

ckcheun43
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views51 pages

12 Reinforcement Learning Full

Okay, let's break this down step-by-step: VA = 0.6×160 + 0.2×VB + 0.2×VS VB = 0.2×VF + 0.2×VS + 0.6×VB VF = 0.3×VD + 0.7×VF VS = 0.3×VD + 0.7×VS VD = 0 We can set up a system of equations and solve for VA, VB, VF, VS, VD.

Uploaded by

ckcheun43
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

COMP 2211 Exploring Artificial Intelligence

Introduction to Reinforcement Learning


Dr. Desmond Tsoi, Dr. Cecia Chan
Department of Computer Science & Engineering
The Hong Kong University of Science and Technology, Hong Kong SAR, China
Reinforcement Learning
The goal of Reinforcement Learning (RL) is to design an autonomous/intelligent agent
that learns by interacting with an environment.
In standard RL setting, the agent perceives a state at every time step and chooses an
action.
The action is applied to the environment and the environment returns a reward and a new
state. The agent trains a policy to choose actions to maximize the sum of rewards.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 2 / 51


Definition of Reinforcement Learning

RL, a type of machine learning, in which agents take actions in an environment aimed at
maximizing their cumulative awards - NVIDIA
RL is based on rewarding desired behaviors or punishing undesired ones. Instead of one
input producing one output, the algorithm produces a variety of outputs and is trained to
select the right one based on certain variables - Gartner

The above definitions are technically provided by experts in that field however for someone
who is starting with RL, these definitions might feel a little bit difficult.

Definition
Through a series of Trial and Error methods, an agent keeps learning continuously is an interactive
environment from its own actions and experiences. The only goal of it is to find a suitable action model
which would increase the total cumulative reward of the agent. It learns via interaction and feedback.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 3 / 51


Explanation of Reinforcement Learning - Daily Life Example
Imagine training your dog to complete a task within an environment.
First, the trainer issues a command, which the dog observes (observation), the dog then
responds by taking an action.
If the action is close to the desired behavior, the trainer will likely provide a reward, such
as a food treat or a toy. Otherwise, no reward or a negative reward will be provided.
At the beginning of training, the dog will likely take more random actions like rolling over
when the command given is “sit”, as it is trying to associate specific observations with
actions and rewards.
This association, or mapping, between observations and actions is called policy.
From the dog’s perspective, the ideal case would be on in which it would respond
correctly to every command, so that it gets as many treats as possible.
So, the whole meaning of reinforcement learning training is to “tune” the dog’s policy so
that it learns the desired behaviors that will maximize some reward.
After training is complete, the dog should be able to observe the owner and take the
appropriate action, for example, sitting when commanded to “sit” by using the internal
policy it has developed.
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 4 / 51
Explanation of Reinforcement Learning - Daily Live Example
Question
List the following in reference to the dog training example.

Agent, Environment, Observations, Actions, Rewards, Policy

Answer
Agent: Your dog
Environment: Your home, backyard, or any other place where you teach and play with your dog
Observations: What the dog observes
Actions: Sit, Roll, Stand, Walk, etc.
Rewards: Food treat or a toy
Policy: Generate the correct actions from the observations

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 5 / 51


Idea of Reinforcement Learning - Another Example
Question
Consider the task of parking a vehicle using automated driving system. List the following in reference to
this problem.

Agent, Environment, Observations, Actions, Rewards, Policy

Answer
Agent: Vehicle computer
Environment: Parking area
Observations: Readings from sensors such as cameras, GPS, and lidar (light detection and ranging)
Actions: Generate steering, braking, and acceleration commands
Rewards: Reach the parking point as soon as possible
Policy: Generate the correct actions from the observations

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 6 / 51


Part I

Basic Concepts

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 7 / 51


Example: Academic Life
An assistant professor gets paid, say, 160K per year.
How much, in total, will the assistant professor earn in their life?
160 + 160 + 160 + 160 + 160 + . . . = Infinity

What’s wrong with this argument?

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 8 / 51


Discounted Rewards & Discounted Sum of Future Rewards
A reward (payment) in the future is not worth quite as much as a reward now, because of
inflation. For example, being promised $10,000 next year is worth only 90% as much as
receiving $10,000 right now.
Question
Assume payment n years in future is worth only (0.9)n of payment now, what is the assistant
professor’s future discounted sum of rewards?

160 + 160 × (0.9)1 + 160 × (0.9)2 + 160 × (0.9)3 + . . .


=160 × (1 + 0.9 + (0.9)2 + (0.9)3 + . . .)
 
1
=160 ×
1 − 0.9
=1600

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 9 / 51


Discounted Rewards & Discounted Sum of Future Rewards

Discounting is a concept, where a parameter called the discount factor, γ, and 0 ≤ γ ≤ 1,


and a power of it multiplies a reward.
The “Discounted sum of future rewards” using discount factor γ is

(reward now )+γ(reward in 1 time step from now )


+γ 2 (reward in 2 time step from now )
+γ 3 (reward in 3 time step from now )
+...

People in economics and probabilistic decision-making do this all the time.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 10 / 51


Example: The Academic Life
0.6 0.6 0.7
Hint: Expected value
A B F computation
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof. Suppose we roll a fair 6-sided
160 480 3200 die. The expected value of
0.2 our die roll is
0.2 0.3
S D 
1
 
1

On the 0.3 Dead ×1 + ×2 +
6 6
Street 0    
80 1 1
×3 + ×4 +
6 6
Define:
   
0.7 1 1
×5 + × 6 = 3.5
VA = Expected discounted sum of future rewards starting in state A 6 6

VB = Expected discounted sum of future rewards starting in state B


Question
VF = Expected discounted sum of future rewards starting in state F
Assume discount factor γ =
VS = Expected discounted sum of future rewards starting in state S 0.9. How do we compute VA ,
VB , VT , VS , TD ?
VD = Expected discounted sum of future rewards starting in state D

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 11 / 51


Example: The Academic Life - Start from state A
But there are so many different possibilities!!! Each with different probability :(

Sample episodes, all start from A:


A → B → F → D:
160 + (0.2)(0.9)1 (480) + (0.2)(0.2)(0.9)2 (3200) + (0.2)(0.2)(0.3)(0.9)3 (0) = 350.08
A → A → S → D:
160 + (0.6)(0.9)1 (160) + (0.6)(0.2)(0.9)2 (80) + (0.6)(0.2)(0.3)(0.9)3 (0) = 254.176
A → B → S → D:
160 + (0.2)(0.9)1 (480) + (0.2)(0.2)(0.9)2 (80) + (0.2)(0.2)(0.3)(0.9)3 (0) = 248.992
···

It is very difficult to compute VA . > . <


{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 12 / 51
Idea
0.6 0.6 0.7

A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80

0.7

Let VA1 , VB1 , VF1 , VS1 , VD1 be the expected discounted sum of rewards over the next 1 time
step from now, do you know how to find them?
Let VA2 be the expected discounted sum of rewards over the next 2 time step from now,
do you know how to find it if you know VA1 , VB1 , VF1 , VS1 , and VD1 ?
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 13 / 51
0.6 0.6 0.7

A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80

0.7

VA1 = 160, VB1 = 480, VF1 = 3200, VS1 = 80, VD1 = 0


VA2 = 160 + 0.9 × (PAA VA1 + PAB VB1 + PAF VF1 + PAS VS1 + PAD VD1 )
= 160 + 0.9 × (0.6(160) + 0.2(480) + 0(3200) + 0.2(80) + 0(0)) = 347.2

Do you know how to compute VB2 , VF2 , VS2 , VD2 , . . .?


{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 14 / 51
Markov Property

Definition
A state St is Markov if and only if

P(St+1 |St ) = P(St+1 |S1 , . . . , St )

The future is independent of the past given the present


The present state captures all relevant information from the history
Once the present state is known, the history may be thrown away

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 15 / 51


Part II

Problem Formulation

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 16 / 51


A Markov System with Rewards
A Markov system consists of the following:
A set of N states {s1 , s2 , . . . , sN }
A transition probability matrix (i.e., a 2D array showing the probability of going from one
state to another state):

 To 
T11 T12 · · · T1N
 T21 T22 · · · T2N 
T = From 
 
.. .. .. .. 
 . . . . 
TN1 TN2 · · · TNN
where Tij = P(next state st+1 = sj | this state st = si )
Note: Each row of the matrix sums to 1
Each state has a reward {r1 , r2 , . . . , rN }
There is a discount factor γ, where 0 < γ < 1
All future rewards are discounted by γ
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 17 / 51
Example: The Academic Life

0.6 0.6 0.7

A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80

0.7

What are the states, transition probability matrix, rewards, discount factor for this problem?

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 18 / 51


Example: Academic Life

0.6 0.6 0.7 States: {A, B, F , S, D}


Transition probability matrix:
A B F  
Assist. 0.2 Assoc. 0.2 Full 0.6 0.2 0 0.2 0
Prof. Prof. Prof.  0 0.6 0.2 0.2 0 
160 480 3200
 
0.2 T=  0 0 0.7 0 0.3 

0.2 0.3  0 0 0 0.7 0.3 
S D
On the 0.3 Dead 0 0 0 0 0
Street 0
80 Rewards: {160, 480, 3200, 80, 0}
Discount factor: 0.9
0.7

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 19 / 51


Value Function

Solving a Markov system


Write V (si ) = expected discounted sum of future rewards starting in state si

V (si ) = r (si ) + γ · Expected future rewards starting from next state


= r (si ) + γ · (Ti1 V (s1 ) + Ti2 V (s2 ) + . . . + T1N V (SN ))

Using vector notation, we have


      
V (s1 ) r1 T11 T12 · · · T1N V (s1 )
 V (s2 )   r2   T21 T22 · · · T2N  V (s2 ) 
 =  .. +γ
      
 .. .. .. .. ..  .. 
 .   .   . . . .  . 
V (sN ) rN TN1 TN2 · · · TNN V (sN )

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 20 / 51


Solving the System of Linear Equations

The equation is a linear equation, which can be solved directly.

V = R + γTV
(1 − γT )V = R
V = (1 − γT )−1 R

The good thing for directly solving the above equation is you get an exact number.
The bad thing is it is slow if you have a large number of states, i.e., N is big.
There are many iterative methods for solving the equation, e.g.,
Dynamic programming (We will do Value Iteration)
Monte-Carlo evaluation
Temporal-Difference learning

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 21 / 51


Value Iteration

Define
V 1 (si ) = Expected discounted sum of rewards over the next 1 time step from now
V 2 (si ) = Expected discounted sum of rewards over the next 2 time steps from now
V 3 (si ) = Expected discounted sum of rewards over the next 3 time steps from now
···
V k (si ) = Expected discounted sum of rewards over the next k time steps from now
What are the formula to compute all of them?
V 1 (si ) = r (si )
V 2 (si ) = r (si ) + γ(Ti1 V 1 (s1 ) + Ti2 V 1 (s2 ) + . . . + TiN V 1 (sN ))
V 3 (si ) = r (si ) + γ(Ti1 V 2 (s1 ) + Ti2 V 2 (s2 ) + . . . + TiN V 2 (sN ))
···
V k (si ) = r (si ) + γ(Ti1 V k−1 (s1 ) + Ti2 V k−1 (s2 ) + . . . + TiN V k−1 (sN ))

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 22 / 51


Example: Weather

 
0.5 0.5 0
T =  0.5 0 0.5 
W
Wind 0 0.5 0.5
0.5 0 0.5

k V k (S) V k (W ) V k (H)
S 0.5 0.5 H
Sun Hail 1
4 -8 2
0.5 0.5 3
4
5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 23 / 51


Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2
3
S 0.5 0.5 H
Sun Hail 4
4 -8 5
0.5 0.5

V 1 (S) = r (S) = 4
V 1 (W ) = r (W ) = 0
V 1 (H) = r (H) = −8
V 2 (S) = r (S) + γ(TSS V 1 (S) + TWS V 1 (W ) + THS V 1 (H))
V 2 (W ) = r (W ) + γ(TSW V 1 (S)) + TWW V 1 (W ) + THW V 1 (H)
V 2 (H) = r (H) + γ(TSH V 1 (S) + THW V 1 (W ) + THH V 1 (H))
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 24 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3
S 0.5 0.5 H
Sun Hail 4
4 -8 5
0.5 0.5

V 2 (S) = r (S) + γ(TSS V 1 (S) + TWS V 1 (W ) + THS V 1 (H))


= 4 + 0.5(0.5 × 4 + 0.5 × 0 + 0 × (−8)) = 5
V (W ) = r (W ) + γ(TSW V 1 (S)) + TWW V 1 (W ) + THW V 1 (H)
2

= 0 + 0.5(0.5 × 4 + 0 × 0 + 0.5 × (−8)) = −1


V (H) = r (H) + γ(TSH V 1 (S) + THW V 1 (W ) + THH V 1 (H))
2

= −8 + 0.5(0 × 4 + 0.5 × 0 + 0.5 × (−8)) = −10


{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 25 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3 5 -1.25 -10.75
S 0.5 0.5 H
Sun Hail 4
4 -8 5
0.5 0.5

V 3 (S) = r (S) + γ(TSS V 2 (S) + TWS V 2 (W ) + THS V 2 (H))


= 4 + 0.5(0.5 × 5 + 0.5 × (−1) + 0 × (−10)) = 5
V (W ) = r (W ) + γ(TSW V 2 (S)) + TWW V 2 (W ) + THW V 2 (H)
3

= 0 + 0.5(0.5 × 5 + 0 × (−1) + 0.5 × (−10)) = −1.25


V (H) = r (H) + γ(TSH V 2 (S) + THW V 2 (W ) + THH V 2 (H))
3

= −8 + 0.5(0 × 5 + 0.5 × (−1) + 0.5 × (−10)) = −10.75


{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 26 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3 5 -1.25 -10.75
S 0.5 0.5 H
Sun Hail 4 4.9375 -1.4375 -11
4 -8 5
0.5 0.5

V 4 (S) = r (S) + γ(TSS V 3 (S) + TWS V 3 (W ) + THS V 3 (H))


= 4 + 0.5(0.5 × 5 + 0.5 × (−1.25) + 0 × (−10.75)) = 4.9375
V (W ) = r (W ) + γ(TSW V 3 (S)) + TWW V 3 (W ) + THW V 3 (H)
4

= 0 + 0.5(0.5 × 5 + 0 × (−1.25) + 0.5 × (−10.75)) = −1.4375


V (H) = r (H) + γ(TSH V 3 (S) + THW V 3 (W ) + THH V 3 (H))
4

= −8 + 0.5(0 × 5 + 0.5 × (−1.25) + 0.5 × (−10.75)) = −11


{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 27 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3 5 -1.25 -10.75
S 0.5 0.5 H
Sun Hail 4 4.9375 -1.4375 -11
4 -8 5 4.875 -1.515625 -11.109375
0.5 0.5

V 5 (S) = r (S) + γ(TSS V 4 (S) + TWS V 4 (W ) + THS V 4 (H))


= 4 + 0.5(0.5 × 4.9375 + 0.5 × (−1.4375) + 0 × (−11)) = 4.875
V (W ) = r (W ) + γ(TSW V 4 (S)) + TWW V 4 (W ) + THW V 4 (H)
5

= 0 + 0.5(0.5 × 4.9375 + 0 × (−1.4375) + 0.5 × (−11)) = −1.515625


V (H) = r (H) + γ(TSH V 4 (S) + THW V 4 (W ) + THH V 4 (H))
5

= −8 + 0.5(0 × 4.9375 + 0.5 × (−1.4375) + 0.5 × (−11)) = −11.109375


{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 28 / 51
Value Iteration for Solving the System of Linear Equations

Compute V 1 (Si ) for each i in range [1, N]


Compute V 2 (Si ) for each i in range [1, N]
Compute V 3 (Si ) for each i in range [1, N]
..
.
Compute V k (Si ) for each i in range [1, N]

When to stop?
When the maximum absolute difference between two successive expected discounted sum of
rewards (V k and V k−1 ) is less than a threshold, ξ, i.e.,

Maxi |V k (si ) − V k−1 (si )| < ξ

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 29 / 51


Markov Decision Process
A Markov Decision Process (MDP) is a Markov reward process with decisions.
It is an environment in which all states are Markov.

Definition
A Markov Decision Process is a tuple ⟨S, A, T , R, γ⟩
S: A finite set of states {s1 , s2 , . . . , sN }
A: A finite set of actions {a1 , a2 , . . . , aM }
T: A transition probability matrix

Tija = P(Sj |Si , A = a)

R: Each state has a reward {r1 , r2 , . . . , rN }


γ: A discount factor 0 ≤ γ ≤ 1

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 30 / 51


Value Iteration (How to Determine Actions?)
Define
V 1 (si ) = Expected discounted sum of rewards over the next 1 time step from now
V 2 (si ) = Expected discounted sum of rewards over the next 2 time steps from now
V 3 (si ) = Expected discounted sum of rewards over the next 3 time steps from now
···
V k (si ) = Expected discounted sum of rewards over the next k time steps from now
What are the formula to compute all of them?
V 1 (si ) = r (si )
V 2 (si ) = maxa (r (si ) + γ(Ti1a V 1 (s1 ) + Ti2a V 1 (s2 ) + . . . + TiN
a
V 1 (sN )))
V (si ) = maxa (r (si ) + γ(Ti1 V (s1 ) + Ti2 V (s2 ) + . . . + TiN V 2 (sN )))
3 a 2 a 2 a

···
V k (si ) = maxa (r (si ) + γ(Ti1a V k−1 (s1 ) + Ti2a V k−1 (s2 ) + . . . + TiN a
V k−1 (sN )))

Bellman Optimality Equation


V k (si ) = maxa (r (si ) + γ(Ti1
a V k−1 (s ) + T a V k−1 (s ) + . . . + T a V k−1 (s )))
1 i2 2 iN N

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 31 / 51


Policies
A policy is a mapping from states to actions
Policy 1
A,
0.5 A,
1 State Action
A, 0.5 PU S
S, 1

PU PF PF A
Poor & Poor &
Unknown Famous RU S
0 0
S, 0.5 RF A

Policy 2
.5
A, 0.5
S, 0.5

S, 0.5
A, 1
,0
A

State Action
PU A
PF A
RU RF
Rich & Rich &
RU A
Unknown Famous
RF A
S, 0.5
S, 0.5

10 10
S, 0.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 32 / 51


Finding the Near Optimal Policy

Compute V k (si ) for all i using value iteration


Compute (near) optimal policy π(si ) in state si as
a V k−1 (s ) + T a V k−1 (s ) + . . . + T a V k−1 (s )))
π(si ) = argmaxa (r (si ) + γ(Ti1 1 i2 2 iN N
until

Maxi |V k+1 (si ) − V k (si )| < ξ

Once it is done, the near optimal policy consists of taking the action that leads to the
state that has maximum state value.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 33 / 51


Example
A,
0.5 A,
1 γ = 0.9
A, 0.5
S, 1

PU PF PU PF RU RF
Poor & Poor & R= 0 0 10 10
Unknown Famous
0 S, 0.5 0
PU PF RU RF
PU 0.5 0.5 0 0
.5 T A= PF 0 1 0 0
A, 0.5
S, 0.5

S, 0.5
A, 1
,0

RU 0.5 0.5 0 0
A

RF 0 1 0 0

RU RF PU PF RU RF
Rich & Rich & PU 1 0 0 0
Unknown Famous TS= PF 0.5 0 0 0.5

5
S, 0.

S, 0.
10 S, 0.5 10
RU 0.5 0 0.5 0
5

RF 0 0 0.5 0.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 34 / 51


Example
γ = 0.9

PU PF RU RF k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


R= 0 0 10 10 1 1
2 2
3 3
PU PF RU RF 4 4
PU 0.5 0.5 0 0 5 5
T A= PF 0 1 0 0 6 6
RU 0.5 0.5 0 0
RF 0 1 0 0

PU PF RU RF
PU 1 0 0 0
S
T = PF 0.5 0 0 0.5
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 35 / 51


Example

γ = 0.9
PU PF RU RF
R= 0 0 10 10
k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )
1 0 0 10 10 1
PU PF RU RF 2 2
PU 0.5 0.5 0 0 3 3
T A= PF 0 1 0 0 4 4
RU 0.5 0.5 0 0 5 5
RF 0 1 0 0 6 6

PU
PU
1
PF
0
RU
0
RF
0
V 1 (PU) = 0
S
T = PF
RU
0.5
0.5
0
0
0
0.5
0.5
0
V 1 (PF ) = 0
RF 0 0 0.5 0.5
V 1 (RU) = 10
V 1 (RF ) = 10

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 36 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 2 A/S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (PU) = max(r (PU) + γ(TPU→PU
A
V 1 (PU)+
A
TPU→PF V 1 (PF ) + TPU→RU
A
V 1 (RU) + TPU→RF
A
V 1 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (PU) + γ(TPU→PU V 1 (PU)+
TS= PF 0.5 0 0 0.5 S
TPU→PF V 1 (PF ) + TPU→RU
S
V 1 (RU) + TPU→RF
S
V 1 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 2 (PU) = max(0 + 0.9 × (0.5 × 0 + 0.5 × 0 + 0 × 10 + 0 × 10)),
0 + 0.9 × (1 × 0 + 0 × 0 + 0 × 10 + 0 × 10))
= max(0, 0) = 0

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 37 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 2 A/S S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (PF ) = max(r (PF ) + γ(TPF
A 1
→PU V (PU)+
A 1 A 1 A 1
TPF →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )),
PU PF RU RF S 1
PU 1 0 0 0
r (PF ) + γ(TPF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TPF 1 S 1 S 1
RU 0.5 0 0.5 0 →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )))
RF 0 0 0.5 0.5 V 2 (PF ) = max(0 + 0.9 × (0 × 0 + 1 × 0 + 0 × 10 + 0 × 10)),
0 + 0.9 × (0.5 × 0 + 0 × 0 + 0.5 × 10 + 0.5 × 10))
= max(0, 4.5) = 4.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 38 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 2 A/S S S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (RU) = max(r (RU) + γ(TRU→PU
A
V 1 (PU)+
A
TRU→PF V 1 (PF ) + TRU→RU
A
V 1 (RU) + TRU→RF
A
V 1 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (RU) + γ(TRU→PU V 1 (PU)+
TS= PF 0.5 0 0 0.5 S
TRU→PF V 1 (PF ) + TRU→RU
S
V 1 (RU) + TRU→RF
S
V 1 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 2 (PF ) = max(10 + 0.9 × (0.5 × 0 + 0.5 × 0 + 0 × 10 + 0 × 10)),
10 + 0.9 × (0.5 × 0 + 0 × 0 + 0.5 × 10 + 0 × 10))
= max(0, 14.5) = 14.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 39 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (RF ) = max(r (RF ) + γ(TRF
A 1
→PU V (PU)+
A 1 A 1 A 1
TRF →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )),
PU PF RU RF S 1
PU 1 0 0 0
r (RF ) + γ(TRF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TRF 1 S 1 S 1
RU 0.5 0 0.5 0 →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )))
RF 0 0 0.5 0.5 V 2 (PF ) = max(10 + 0.9 × (0 × 0 + 1 × 0 + 0 × 10 + 0 × 10)),
10 + 0.9 × (0 × 0 + 0 × 0 + 0.5 × 10 + 0.5 × 10))
= max(0, 19) = 19

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 40 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2
R= 0 0 10 10 3 2.025 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (PU) = max(r (PU) + γ(TPU→PU
A
V 2 (PU)+
A
TPU→PF V 2 (PF ) + TPU→RU
A
V 2 (RU) + TPU→RF
A
V 2 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (PU) + γ(TPU→PU V 2 (PU)+
TS= PF 0.5 0 0 0.5 S
TPU→PF V 2 (PF ) + TPU→RU
S
V 2 (RU) + TPU→RF
S
V 2 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 3 (PU) = max(0 + 0.9 × (0.5 × 0 + 0.5 × 4.5 + 0 × 14.5 + 0 × 19)),
0 + 0.9 × (1 × 0 + 0 × 4.5 + 0 × 14.5 + 0 × 19))
= max(2.205, 0) = 2.025

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 41 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 3 A S
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (PF ) = max(r (PF ) + γ(TPF
A 2
→PU V (PU)+
A 2 A 2 A 2
TPF →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )),
PU PF RU RF S 2
PU 1 0 0 0
r (PF ) + γ(TPF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TPF 2 S 2 S 2
RU 0.5 0 0.5 0 →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )))
RF 0 0 0.5 0.5 V 3 (PF ) = max(0 + 0.9 × (0 × 0 + 1 × 4.5 + 0 × 14.5 + 0 × 19)),
0 + 0.9 × (0.5 × 0 + 0 × 4.5 + 0 × 14.5 + 0.5 × 19))
= max(4.05, 8.55) = 8.55

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 42 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 3 A S S
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (RU) = max(r (RU) + γ(TRU→PU
A
V 2 (PU)+
A
TRU→PF V 2 (PF ) + TRU→RU
A
V 2 (RU) + TRU→RF
A
V 2 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (RU) + γ(TRU→PU V 2 (PU)+
TS= PF 0.5 0 0 0.5 S
TRU→PF V 2 (PF ) + TRU→RU
S
V 2 (RU) + TRU→RF
S
V 2 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 3 (RU) = max(10 + 0.9 × (0.5 × 0 + 0.5 × 4.5 + 0 × 14.5 + 0 × 19)),
10 + 0.9 × (0.5 × 0 + 0 × 4.5 + 0.5 × 14.5 + 0 × 19))
= max(12.025, 16.525) = 16.525

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 43 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (RF ) = max(r (RF ) + γ(TRF
A 2
→PU V (PU)+
A 2 A 2 A 2
TRF →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )),
PU PF RU RF S 2
PU 1 0 0 0
r (RF ) + γ(TRF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TRF 2 S 2 S 2
RU 0.5 0 0.5 0 →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )))
RF 0 0 0.5 0.5 V 3 (PF ) = max(10 + 0.9 × (0 × 0 + 1 × 4.5 + 0 × 14.5 + 0 × 19)),
10 + 0.9 × (0 × 0 + 0 × 4.5 + 0.5 × 14.5 + 0.5 × 19))
= max(14.05, 25.075) = 25.075

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 44 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 4 A
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (PU) = max(r (PU) + γ(TPU→PU
A
V 3 (PU)+
A
TPU→PF V 3 (PF ) + TPU→RU
A
V 3 (RU) + TPU→RF
A
V 3 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (PU) + γ(TPU→PU V 3 (PU)+
TS= PF 0.5 0 0 0.5 S
TPU→PF V 3 (PF ) + TPU→RU
S
V 3 (RU) + TPU→RF
S
V 3 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 4 (PU) = max(0 + 0.9 × (0.5 × 2.025 + 0.5 × 8.55 + 0 × 16.525 + 0 × 25.075)),
0 + 0.9 × (1 × 2.025 + 0 × 8.55 + 0 × 16.525 + 0 × 25.075))
= max(4.75875, 1.8225) = 4.75875

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 45 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 4 A S
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (PF ) = max(r (PF ) + γ(TPF
A 3
→PU V (PU)+
A 3 A 3 A 3
TPF →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )),
PU PF RU RF S 3
PU 1 0 0 0
r (PF ) + γ(TPF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TPF 3 S 3 S 3
RU 0.5 0 0.5 0 →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )))
RF 0 0 0.5 0.5 V 4 (PF ) = max(0 + 0.9 × (0 × 2.025 + 1 × 8.55 + 0 × 16.525 + 0 × 25.075)),
0 + 0.9 × (0.5 × 2.025 + 0 × 8.55 + 0 × 16.525 + 0.5 × 25.075))
= max(7.695, 12.195) = 12.195

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 46 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 18.3475 4 A S S
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (RU) = max(r (RU) + γ(TRU→PU
A
V 3 (PU)+
A
TRU→PF V 3 (PF ) + TRU→RU
A
V 3 (RU) + TRU→RF
A
V 3 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (RU) + γ(TRU→PU V 3 (PU)+
TS= PF 0.5 0 0 0.5 S
TRU→PF V 3 (PF ) + TRU→RU
S
V 3 (RU) + TRU→RF
S
V 3 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 4 (PF ) = max(10 + 0.9 × (0.5 × 2.025 + 0.5 × 8.55 + 0 × 16.525 + 0 × 25.075)),
10 + 0.9 × (0.5 × 2.025 + 0 × 8.55 + 0.5 × 16.525 + 0 × 25.075))
= max(14.75875, 18.3475) = 18.3475

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 47 / 51


Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )


1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 18.3475 28.72 4 A S S S
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (RF ) = max(r (RF ) + γ(TRF
A 3
→PU V (PU)+
A 3 A 3 A 3
TRF →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )),
PU PF RU RF S 3
PU 1 0 0 0
r (RF ) + γ(TRF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TRF 3 S 3 S 3
RU 0.5 0 0.5 0 →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )))
RF 0 0 0.5 0.5 V 4 (RF ) = max(10 + 0.9 × (0 × 2.025 + 1 × 8.55 + 0 × 16.525 + 0 × 25.075)),
10 + 0.9 × (0 × 2.025 + 0 × 8.55 + 0.5 × 16.525 + 0.5 × 25.075))
= max(17.695, 28.72) = 28.72

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 48 / 51


Example

PU PF RU RF
R= 0 0 10 10
γ = 0.9

PU PF RU RF
PU 0.5 0.5 0 0 k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )
T A= PF 0 1 0 0 1 0 0 10 10 1
RU 0.5 0.5 0 0 2 0 4.5 14.5 19 2 A/S S S S
RF 0 1 0 0 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 18.3475 28.72 4 A S S S
5 7.62919 15.0654 20.3978 31.1804 5 A S S S
6 10.2126 17.4643 22.6121 33.2102 6 A S S S
PU PF RU RF
PU 1 0 0 0
S
T = PF 0.5 0 0 0.5
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5

Practice
Can you calculate the remaining ones? :)

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 49 / 51


Pros and Cons of Value Iteration

Pros:
Will converge towards optimal values
Good for a small set of states
Cons:
Value iteration has to touch every state in every iteration and so if we have a large number
of total states, value iteration suffers
It is slow because we have to consider all actions at every state, and often, there are many
actions

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 50 / 51


That’s all!
Any questions?

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 51 / 51

You might also like