12 Reinforcement Learning Full
12 Reinforcement Learning Full
RL, a type of machine learning, in which agents take actions in an environment aimed at
maximizing their cumulative awards - NVIDIA
RL is based on rewarding desired behaviors or punishing undesired ones. Instead of one
input producing one output, the algorithm produces a variety of outputs and is trained to
select the right one based on certain variables - Gartner
The above definitions are technically provided by experts in that field however for someone
who is starting with RL, these definitions might feel a little bit difficult.
Definition
Through a series of Trial and Error methods, an agent keeps learning continuously is an interactive
environment from its own actions and experiences. The only goal of it is to find a suitable action model
which would increase the total cumulative reward of the agent. It learns via interaction and feedback.
Answer
Agent: Your dog
Environment: Your home, backyard, or any other place where you teach and play with your dog
Observations: What the dog observes
Actions: Sit, Roll, Stand, Walk, etc.
Rewards: Food treat or a toy
Policy: Generate the correct actions from the observations
Answer
Agent: Vehicle computer
Environment: Parking area
Observations: Readings from sensors such as cameras, GPS, and lidar (light detection and ranging)
Actions: Generate steering, braking, and acceleration commands
Rewards: Reach the parking point as soon as possible
Policy: Generate the correct actions from the observations
Basic Concepts
A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80
0.7
Let VA1 , VB1 , VF1 , VS1 , VD1 be the expected discounted sum of rewards over the next 1 time
step from now, do you know how to find them?
Let VA2 be the expected discounted sum of rewards over the next 2 time step from now,
do you know how to find it if you know VA1 , VB1 , VF1 , VS1 , and VD1 ?
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 13 / 51
0.6 0.6 0.7
A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80
0.7
Definition
A state St is Markov if and only if
Problem Formulation
To
T11 T12 · · · T1N
T21 T22 · · · T2N
T = From
.. .. .. ..
. . . .
TN1 TN2 · · · TNN
where Tij = P(next state st+1 = sj | this state st = si )
Note: Each row of the matrix sums to 1
Each state has a reward {r1 , r2 , . . . , rN }
There is a discount factor γ, where 0 < γ < 1
All future rewards are discounted by γ
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 17 / 51
Example: The Academic Life
A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80
0.7
What are the states, transition probability matrix, rewards, discount factor for this problem?
V = R + γTV
(1 − γT )V = R
V = (1 − γT )−1 R
The good thing for directly solving the above equation is you get an exact number.
The bad thing is it is slow if you have a large number of states, i.e., N is big.
There are many iterative methods for solving the equation, e.g.,
Dynamic programming (We will do Value Iteration)
Monte-Carlo evaluation
Temporal-Difference learning
Define
V 1 (si ) = Expected discounted sum of rewards over the next 1 time step from now
V 2 (si ) = Expected discounted sum of rewards over the next 2 time steps from now
V 3 (si ) = Expected discounted sum of rewards over the next 3 time steps from now
···
V k (si ) = Expected discounted sum of rewards over the next k time steps from now
What are the formula to compute all of them?
V 1 (si ) = r (si )
V 2 (si ) = r (si ) + γ(Ti1 V 1 (s1 ) + Ti2 V 1 (s2 ) + . . . + TiN V 1 (sN ))
V 3 (si ) = r (si ) + γ(Ti1 V 2 (s1 ) + Ti2 V 2 (s2 ) + . . . + TiN V 2 (sN ))
···
V k (si ) = r (si ) + γ(Ti1 V k−1 (s1 ) + Ti2 V k−1 (s2 ) + . . . + TiN V k−1 (sN ))
0.5 0.5 0
T = 0.5 0 0.5
W
Wind 0 0.5 0.5
0.5 0 0.5
k V k (S) V k (W ) V k (H)
S 0.5 0.5 H
Sun Hail 1
4 -8 2
0.5 0.5 3
4
5
V 1 (S) = r (S) = 4
V 1 (W ) = r (W ) = 0
V 1 (H) = r (H) = −8
V 2 (S) = r (S) + γ(TSS V 1 (S) + TWS V 1 (W ) + THS V 1 (H))
V 2 (W ) = r (W ) + γ(TSW V 1 (S)) + TWW V 1 (W ) + THW V 1 (H)
V 2 (H) = r (H) + γ(TSH V 1 (S) + THW V 1 (W ) + THH V 1 (H))
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 24 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3
S 0.5 0.5 H
Sun Hail 4
4 -8 5
0.5 0.5
When to stop?
When the maximum absolute difference between two successive expected discounted sum of
rewards (V k and V k−1 ) is less than a threshold, ξ, i.e.,
Definition
A Markov Decision Process is a tuple ⟨S, A, T , R, γ⟩
S: A finite set of states {s1 , s2 , . . . , sN }
A: A finite set of actions {a1 , a2 , . . . , aM }
T: A transition probability matrix
···
V k (si ) = maxa (r (si ) + γ(Ti1a V k−1 (s1 ) + Ti2a V k−1 (s2 ) + . . . + TiN a
V k−1 (sN )))
PU PF PF A
Poor & Poor &
Unknown Famous RU S
0 0
S, 0.5 RF A
Policy 2
.5
A, 0.5
S, 0.5
S, 0.5
A, 1
,0
A
State Action
PU A
PF A
RU RF
Rich & Rich &
RU A
Unknown Famous
RF A
S, 0.5
S, 0.5
10 10
S, 0.5
Once it is done, the near optimal policy consists of taking the action that leads to the
state that has maximum state value.
PU PF PU PF RU RF
Poor & Poor & R= 0 0 10 10
Unknown Famous
0 S, 0.5 0
PU PF RU RF
PU 0.5 0.5 0 0
.5 T A= PF 0 1 0 0
A, 0.5
S, 0.5
S, 0.5
A, 1
,0
RU 0.5 0.5 0 0
A
RF 0 1 0 0
RU RF PU PF RU RF
Rich & Rich & PU 1 0 0 0
Unknown Famous TS= PF 0.5 0 0 0.5
5
S, 0.
S, 0.
10 S, 0.5 10
RU 0.5 0 0.5 0
5
RF 0 0 0.5 0.5
PU PF RU RF
PU 1 0 0 0
S
T = PF 0.5 0 0 0.5
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5
γ = 0.9
PU PF RU RF
R= 0 0 10 10
k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )
1 0 0 10 10 1
PU PF RU RF 2 2
PU 0.5 0.5 0 0 3 3
T A= PF 0 1 0 0 4 4
RU 0.5 0.5 0 0 5 5
RF 0 1 0 0 6 6
PU
PU
1
PF
0
RU
0
RF
0
V 1 (PU) = 0
S
T = PF
RU
0.5
0.5
0
0
0
0.5
0.5
0
V 1 (PF ) = 0
RF 0 0 0.5 0.5
V 1 (RU) = 10
V 1 (RF ) = 10
PU PF RU RF
R= 0 0 10 10
γ = 0.9
PU PF RU RF
PU 0.5 0.5 0 0 k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )
T A= PF 0 1 0 0 1 0 0 10 10 1
RU 0.5 0.5 0 0 2 0 4.5 14.5 19 2 A/S S S S
RF 0 1 0 0 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 18.3475 28.72 4 A S S S
5 7.62919 15.0654 20.3978 31.1804 5 A S S S
6 10.2126 17.4643 22.6121 33.2102 6 A S S S
PU PF RU RF
PU 1 0 0 0
S
T = PF 0.5 0 0 0.5
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5
Practice
Can you calculate the remaining ones? :)
Pros:
Will converge towards optimal values
Good for a small set of states
Cons:
Value iteration has to touch every state in every iteration and so if we have a large number
of total states, value iteration suffers
It is slow because we have to consider all actions at every state, and often, there are many
actions