0% found this document useful (0 votes)

25 views51 pages

12 Reinforcement Learning Full

Okay, let's break this down step-by-step: VA = 0.6×160 + 0.2×VB + 0.2×VS VB = 0.2×VF + 0.2×VS + 0.6×VB VF = 0.3×VD + 0.7×VF VS = 0.3×VD + 0.7×VS VD = 0 We can set up a system of equations and solve for VA, VB, VF, VS, VD.

Uploaded by

ckcheun43

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views51 pages

12 Reinforcement Learning Full

Uploaded by

ckcheun43

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

COMP 2211 Exploring Artificial Intelligence

Introduction to Reinforcement Learning

Dr. Desmond Tsoi, Dr. Cecia Chan
Department of Computer Science & Engineering
The Hong Kong University of Science and Technology, Hong Kong SAR, China
Reinforcement Learning
The goal of Reinforcement Learning (RL) is to design an autonomous/intelligent agent
that learns by interacting with an environment.
In standard RL setting, the agent perceives a state at every time step and chooses an
action.
The action is applied to the environment and the environment returns a reward and a new
state. The agent trains a policy to choose actions to maximize the sum of rewards.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 2 / 51

Definition of Reinforcement Learning

RL, a type of machine learning, in which agents take actions in an environment aimed at
maximizing their cumulative awards - NVIDIA
RL is based on rewarding desired behaviors or punishing undesired ones. Instead of one
input producing one output, the algorithm produces a variety of outputs and is trained to
select the right one based on certain variables - Gartner

The above definitions are technically provided by experts in that field however for someone
who is starting with RL, these definitions might feel a little bit difficult.

Definition
Through a series of Trial and Error methods, an agent keeps learning continuously is an interactive
environment from its own actions and experiences. The only goal of it is to find a suitable action model
which would increase the total cumulative reward of the agent. It learns via interaction and feedback.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 3 / 51

Explanation of Reinforcement Learning - Daily Life Example
Imagine training your dog to complete a task within an environment.
First, the trainer issues a command, which the dog observes (observation), the dog then
responds by taking an action.
If the action is close to the desired behavior, the trainer will likely provide a reward, such
as a food treat or a toy. Otherwise, no reward or a negative reward will be provided.
At the beginning of training, the dog will likely take more random actions like rolling over
when the command given is “sit”, as it is trying to associate specific observations with
actions and rewards.
This association, or mapping, between observations and actions is called policy.
From the dog’s perspective, the ideal case would be on in which it would respond
correctly to every command, so that it gets as many treats as possible.
So, the whole meaning of reinforcement learning training is to “tune” the dog’s policy so
that it learns the desired behaviors that will maximize some reward.
After training is complete, the dog should be able to observe the owner and take the
appropriate action, for example, sitting when commanded to “sit” by using the internal
policy it has developed.
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 4 / 51
Explanation of Reinforcement Learning - Daily Live Example
Question
List the following in reference to the dog training example.

Agent, Environment, Observations, Actions, Rewards, Policy

Answer
Agent: Your dog
Environment: Your home, backyard, or any other place where you teach and play with your dog
Observations: What the dog observes
Actions: Sit, Roll, Stand, Walk, etc.
Rewards: Food treat or a toy
Policy: Generate the correct actions from the observations

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 5 / 51

Idea of Reinforcement Learning - Another Example
Question
Consider the task of parking a vehicle using automated driving system. List the following in reference to
this problem.

Agent, Environment, Observations, Actions, Rewards, Policy

Answer
Agent: Vehicle computer
Environment: Parking area
Observations: Readings from sensors such as cameras, GPS, and lidar (light detection and ranging)
Actions: Generate steering, braking, and acceleration commands
Rewards: Reach the parking point as soon as possible
Policy: Generate the correct actions from the observations

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 6 / 51

Part I

Basic Concepts

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 7 / 51

Example: Academic Life
An assistant professor gets paid, say, 160K per year.
How much, in total, will the assistant professor earn in their life?
160 + 160 + 160 + 160 + 160 + . . . = Infinity

What’s wrong with this argument?

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 8 / 51

Discounted Rewards & Discounted Sum of Future Rewards
A reward (payment) in the future is not worth quite as much as a reward now, because of
inflation. For example, being promised $10,000 next year is worth only 90% as much as
receiving $10,000 right now.
Question
Assume payment n years in future is worth only (0.9)n of payment now, what is the assistant
professor’s future discounted sum of rewards?

160 + 160 × (0.9)1 + 160 × (0.9)2 + 160 × (0.9)3 + . . .

=160 × (1 + 0.9 + (0.9)2 + (0.9)3 + . . .)

1
=160 ×
1 − 0.9
=1600

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 9 / 51

Discounted Rewards & Discounted Sum of Future Rewards

Discounting is a concept, where a parameter called the discount factor, γ, and 0 ≤ γ ≤ 1,

and a power of it multiplies a reward.
The “Discounted sum of future rewards” using discount factor γ is

(reward now )+γ(reward in 1 time step from now )

+γ 2 (reward in 2 time step from now )
+γ 3 (reward in 3 time step from now )
+...

People in economics and probabilistic decision-making do this all the time.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 10 / 51

Example: The Academic Life
0.6 0.6 0.7
Hint: Expected value
A B F computation
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof. Suppose we roll a fair 6-sided
160 480 3200 die. The expected value of
0.2 our die roll is
0.2 0.3
S D
1

1

On the 0.3 Dead ×1 + ×2 +
6 6
Street 0
80 1 1
×3 + ×4 +
6 6
Define:

0.7 1 1
×5 + × 6 = 3.5
VA = Expected discounted sum of future rewards starting in state A 6 6

VB = Expected discounted sum of future rewards starting in state B

Question
VF = Expected discounted sum of future rewards starting in state F
Assume discount factor γ =
VS = Expected discounted sum of future rewards starting in state S 0.9. How do we compute VA ,
VB , VT , VS , TD ?
VD = Expected discounted sum of future rewards starting in state D

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 11 / 51

Example: The Academic Life - Start from state A
But there are so many different possibilities!!! Each with different probability :(

Sample episodes, all start from A:

A → B → F → D:
160 + (0.2)(0.9)1 (480) + (0.2)(0.2)(0.9)2 (3200) + (0.2)(0.2)(0.3)(0.9)3 (0) = 350.08
A → A → S → D:
160 + (0.6)(0.9)1 (160) + (0.6)(0.2)(0.9)2 (80) + (0.6)(0.2)(0.3)(0.9)3 (0) = 254.176
A → B → S → D:
160 + (0.2)(0.9)1 (480) + (0.2)(0.2)(0.9)2 (80) + (0.2)(0.2)(0.3)(0.9)3 (0) = 248.992
···

It is very difficult to compute VA . > . <

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 12 / 51
Idea
0.6 0.6 0.7

A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80

0.7

Let VA1 , VB1 , VF1 , VS1 , VD1 be the expected discounted sum of rewards over the next 1 time
step from now, do you know how to find them?
Let VA2 be the expected discounted sum of rewards over the next 2 time step from now,
do you know how to find it if you know VA1 , VB1 , VF1 , VS1 , and VD1 ?
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 13 / 51
0.6 0.6 0.7

A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80

0.7

VA1 = 160, VB1 = 480, VF1 = 3200, VS1 = 80, VD1 = 0

VA2 = 160 + 0.9 × (PAA VA1 + PAB VB1 + PAF VF1 + PAS VS1 + PAD VD1 )
= 160 + 0.9 × (0.6(160) + 0.2(480) + 0(3200) + 0.2(80) + 0(0)) = 347.2

Do you know how to compute VB2 , VF2 , VS2 , VD2 , . . .?

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 14 / 51
Markov Property

Definition
A state St is Markov if and only if

P(St+1 |St ) = P(St+1 |S1 , . . . , St )

The future is independent of the past given the present

The present state captures all relevant information from the history
Once the present state is known, the history may be thrown away

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 15 / 51

Part II

Problem Formulation

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 16 / 51

A Markov System with Rewards
A Markov system consists of the following:
A set of N states {s1 , s2 , . . . , sN }
A transition probability matrix (i.e., a 2D array showing the probability of going from one
state to another state):

 To 
T11 T12 · · · T1N
 T21 T22 · · · T2N 
T = From 
 
.. .. .. .. 
 . . . . 
TN1 TN2 · · · TNN
where Tij = P(next state st+1 = sj | this state st = si )
Note: Each row of the matrix sums to 1
Each state has a reward {r1 , r2 , . . . , rN }
There is a discount factor γ, where 0 < γ < 1
All future rewards are discounted by γ
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 17 / 51
Example: The Academic Life

0.6 0.6 0.7

A B F
Assist. 0.2 Assoc. 0.2 Full
Prof. Prof. Prof.
160 480 3200
0.2
0.2 0.3
S D
On the 0.3 Dead
Street 0
80

0.7

What are the states, transition probability matrix, rewards, discount factor for this problem?

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 18 / 51

Example: Academic Life

0.6 0.6 0.7 States: {A, B, F , S, D}

Transition probability matrix:
A B F  
Assist. 0.2 Assoc. 0.2 Full 0.6 0.2 0 0.2 0
Prof. Prof. Prof.  0 0.6 0.2 0.2 0 
160 480 3200
 
0.2 T=  0 0 0.7 0 0.3 

0.2 0.3  0 0 0 0.7 0.3 
S D
On the 0.3 Dead 0 0 0 0 0
Street 0
80 Rewards: {160, 480, 3200, 80, 0}
Discount factor: 0.9
0.7

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 19 / 51

Value Function

Solving a Markov system

Write V (si ) = expected discounted sum of future rewards starting in state si

V (si ) = r (si ) + γ · Expected future rewards starting from next state

= r (si ) + γ · (Ti1 V (s1 ) + Ti2 V (s2 ) + . . . + T1N V (SN ))

Using vector notation, we have

      
V (s1 ) r1 T11 T12 · · · T1N V (s1 )
 V (s2 )   r2   T21 T22 · · · T2N  V (s2 ) 
 =  .. +γ
      
 .. .. .. .. ..  .. 
 .   .   . . . .  . 
V (sN ) rN TN1 TN2 · · · TNN V (sN )

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 20 / 51

Solving the System of Linear Equations

The equation is a linear equation, which can be solved directly.

V = R + γTV
(1 − γT )V = R
V = (1 − γT )−1 R

The good thing for directly solving the above equation is you get an exact number.
The bad thing is it is slow if you have a large number of states, i.e., N is big.
There are many iterative methods for solving the equation, e.g.,
Dynamic programming (We will do Value Iteration)
Monte-Carlo evaluation
Temporal-Difference learning

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 21 / 51

Value Iteration

Define
V 1 (si ) = Expected discounted sum of rewards over the next 1 time step from now
V 2 (si ) = Expected discounted sum of rewards over the next 2 time steps from now
V 3 (si ) = Expected discounted sum of rewards over the next 3 time steps from now
···
V k (si ) = Expected discounted sum of rewards over the next k time steps from now
What are the formula to compute all of them?
V 1 (si ) = r (si )
V 2 (si ) = r (si ) + γ(Ti1 V 1 (s1 ) + Ti2 V 1 (s2 ) + . . . + TiN V 1 (sN ))
V 3 (si ) = r (si ) + γ(Ti1 V 2 (s1 ) + Ti2 V 2 (s2 ) + . . . + TiN V 2 (sN ))
···
V k (si ) = r (si ) + γ(Ti1 V k−1 (s1 ) + Ti2 V k−1 (s2 ) + . . . + TiN V k−1 (sN ))

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 22 / 51

Example: Weather

 
0.5 0.5 0
T =  0.5 0 0.5 
W
Wind 0 0.5 0.5
0.5 0 0.5

k V k (S) V k (W ) V k (H)
S 0.5 0.5 H
Sun Hail 1
4 -8 2
0.5 0.5 3
4
5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 23 / 51

Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2
3
S 0.5 0.5 H
Sun Hail 4
4 -8 5
0.5 0.5

V 1 (S) = r (S) = 4
V 1 (W ) = r (W ) = 0
V 1 (H) = r (H) = −8
V 2 (S) = r (S) + γ(TSS V 1 (S) + TWS V 1 (W ) + THS V 1 (H))
V 2 (W ) = r (W ) + γ(TSW V 1 (S)) + TWW V 1 (W ) + THW V 1 (H)
V 2 (H) = r (H) + γ(TSH V 1 (S) + THW V 1 (W ) + THH V 1 (H))
{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 24 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3
S 0.5 0.5 H
Sun Hail 4
4 -8 5
0.5 0.5

V 2 (S) = r (S) + γ(TSS V 1 (S) + TWS V 1 (W ) + THS V 1 (H))

= 4 + 0.5(0.5 × 4 + 0.5 × 0 + 0 × (−8)) = 5
V (W ) = r (W ) + γ(TSW V 1 (S)) + TWW V 1 (W ) + THW V 1 (H)
2

= 0 + 0.5(0.5 × 4 + 0 × 0 + 0.5 × (−8)) = −1

V (H) = r (H) + γ(TSH V 1 (S) + THW V 1 (W ) + THH V 1 (H))
2

= −8 + 0.5(0 × 4 + 0.5 × 0 + 0.5 × (−8)) = −10

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 25 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3 5 -1.25 -10.75
S 0.5 0.5 H
Sun Hail 4
4 -8 5
0.5 0.5

V 3 (S) = r (S) + γ(TSS V 2 (S) + TWS V 2 (W ) + THS V 2 (H))

= 4 + 0.5(0.5 × 5 + 0.5 × (−1) + 0 × (−10)) = 5
V (W ) = r (W ) + γ(TSW V 2 (S)) + TWW V 2 (W ) + THW V 2 (H)
3

= 0 + 0.5(0.5 × 5 + 0 × (−1) + 0.5 × (−10)) = −1.25

V (H) = r (H) + γ(TSH V 2 (S) + THW V 2 (W ) + THH V 2 (H))
3

= −8 + 0.5(0 × 5 + 0.5 × (−1) + 0.5 × (−10)) = −10.75

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 26 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3 5 -1.25 -10.75
S 0.5 0.5 H
Sun Hail 4 4.9375 -1.4375 -11
4 -8 5
0.5 0.5

V 4 (S) = r (S) + γ(TSS V 3 (S) + TWS V 3 (W ) + THS V 3 (H))

= 4 + 0.5(0.5 × 5 + 0.5 × (−1.25) + 0 × (−10.75)) = 4.9375
V (W ) = r (W ) + γ(TSW V 3 (S)) + TWW V 3 (W ) + THW V 3 (H)
4

= 0 + 0.5(0.5 × 5 + 0 × (−1.25) + 0.5 × (−10.75)) = −1.4375

V (H) = r (H) + γ(TSH V 3 (S) + THW V 3 (W ) + THH V 3 (H))
4

= −8 + 0.5(0 × 5 + 0.5 × (−1.25) + 0.5 × (−10.75)) = −11

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 27 / 51
Example: Weather
W
Wind k V k (S) V k (W ) V k (H)
0.5 0 0.5 1 4 0 -8
2 5 -1 -10
3 5 -1.25 -10.75
S 0.5 0.5 H
Sun Hail 4 4.9375 -1.4375 -11
4 -8 5 4.875 -1.515625 -11.109375
0.5 0.5

V 5 (S) = r (S) + γ(TSS V 4 (S) + TWS V 4 (W ) + THS V 4 (H))

= 4 + 0.5(0.5 × 4.9375 + 0.5 × (−1.4375) + 0 × (−11)) = 4.875
V (W ) = r (W ) + γ(TSW V 4 (S)) + TWW V 4 (W ) + THW V 4 (H)
5

= 0 + 0.5(0.5 × 4.9375 + 0 × (−1.4375) + 0.5 × (−11)) = −1.515625

V (H) = r (H) + γ(TSH V 4 (S) + THW V 4 (W ) + THH V 4 (H))
5

= −8 + 0.5(0 × 4.9375 + 0.5 × (−1.4375) + 0.5 × (−11)) = −11.109375

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 28 / 51
Value Iteration for Solving the System of Linear Equations

Compute V 1 (Si ) for each i in range [1, N]

Compute V 2 (Si ) for each i in range [1, N]
Compute V 3 (Si ) for each i in range [1, N]
..
.
Compute V k (Si ) for each i in range [1, N]

When to stop?
When the maximum absolute difference between two successive expected discounted sum of
rewards (V k and V k−1 ) is less than a threshold, ξ, i.e.,

Maxi |V k (si ) − V k−1 (si )| < ξ

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 29 / 51

Markov Decision Process
A Markov Decision Process (MDP) is a Markov reward process with decisions.
It is an environment in which all states are Markov.

Definition
A Markov Decision Process is a tuple ⟨S, A, T , R, γ⟩
S: A finite set of states {s1 , s2 , . . . , sN }
A: A finite set of actions {a1 , a2 , . . . , aM }
T: A transition probability matrix

Tija = P(Sj |Si , A = a)

R: Each state has a reward {r1 , r2 , . . . , rN }

γ: A discount factor 0 ≤ γ ≤ 1

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 30 / 51

Value Iteration (How to Determine Actions?)
Define
V 1 (si ) = Expected discounted sum of rewards over the next 1 time step from now
V 2 (si ) = Expected discounted sum of rewards over the next 2 time steps from now
V 3 (si ) = Expected discounted sum of rewards over the next 3 time steps from now
···
V k (si ) = Expected discounted sum of rewards over the next k time steps from now
What are the formula to compute all of them?
V 1 (si ) = r (si )
V 2 (si ) = maxa (r (si ) + γ(Ti1a V 1 (s1 ) + Ti2a V 1 (s2 ) + . . . + TiN
a
V 1 (sN )))
V (si ) = maxa (r (si ) + γ(Ti1 V (s1 ) + Ti2 V (s2 ) + . . . + TiN V 2 (sN )))
3 a 2 a 2 a

···
V k (si ) = maxa (r (si ) + γ(Ti1a V k−1 (s1 ) + Ti2a V k−1 (s2 ) + . . . + TiN a
V k−1 (sN )))

Bellman Optimality Equation

V k (si ) = maxa (r (si ) + γ(Ti1
a V k−1 (s ) + T a V k−1 (s ) + . . . + T a V k−1 (s )))
1 i2 2 iN N

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 31 / 51

Policies
A policy is a mapping from states to actions
Policy 1
A,
0.5 A,
1 State Action
A, 0.5 PU S
S, 1

PU PF PF A
Poor & Poor &
Unknown Famous RU S
0 0
S, 0.5 RF A

Policy 2
.5
A, 0.5
S, 0.5

S, 0.5
A, 1
,0
A

State Action
PU A
PF A
RU RF
Rich & Rich &
RU A
Unknown Famous
RF A
S, 0.5
S, 0.5

10 10
S, 0.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 32 / 51

Finding the Near Optimal Policy

Compute V k (si ) for all i using value iteration

Compute (near) optimal policy π(si ) in state si as
a V k−1 (s ) + T a V k−1 (s ) + . . . + T a V k−1 (s )))
π(si ) = argmaxa (r (si ) + γ(Ti1 1 i2 2 iN N
until

Maxi |V k+1 (si ) − V k (si )| < ξ

Once it is done, the near optimal policy consists of taking the action that leads to the
state that has maximum state value.

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 33 / 51

Example
A,
0.5 A,
1 γ = 0.9
A, 0.5
S, 1

PU PF PU PF RU RF
Poor & Poor & R= 0 0 10 10
Unknown Famous
0 S, 0.5 0
PU PF RU RF
PU 0.5 0.5 0 0
.5 T A= PF 0 1 0 0
A, 0.5
S, 0.5

S, 0.5
A, 1
,0

RU 0.5 0.5 0 0
A

RF 0 1 0 0

RU RF PU PF RU RF
Rich & Rich & PU 1 0 0 0
Unknown Famous TS= PF 0.5 0 0 0.5

5
S, 0.

S, 0.
10 S, 0.5 10
RU 0.5 0 0.5 0
5

RF 0 0 0.5 0.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 34 / 51

Example
γ = 0.9

PU PF RU RF k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

R= 0 0 10 10 1 1
2 2
3 3
PU PF RU RF 4 4
PU 0.5 0.5 0 0 5 5
T A= PF 0 1 0 0 6 6
RU 0.5 0.5 0 0
RF 0 1 0 0

PU PF RU RF
PU 1 0 0 0
S
T = PF 0.5 0 0 0.5
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 35 / 51

Example

γ = 0.9
PU PF RU RF
R= 0 0 10 10
k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )
1 0 0 10 10 1
PU PF RU RF 2 2
PU 0.5 0.5 0 0 3 3
T A= PF 0 1 0 0 4 4
RU 0.5 0.5 0 0 5 5
RF 0 1 0 0 6 6

PU
PU
1
PF
0
RU
0
RF
0
V 1 (PU) = 0
S
T = PF
RU
0.5
0.5
0
0
0
0.5
0.5
0
V 1 (PF ) = 0
RF 0 0 0.5 0.5
V 1 (RU) = 10
V 1 (RF ) = 10

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 36 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 2 A/S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (PU) = max(r (PU) + γ(TPU→PU
A
V 1 (PU)+
A
TPU→PF V 1 (PF ) + TPU→RU
A
V 1 (RU) + TPU→RF
A
V 1 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (PU) + γ(TPU→PU V 1 (PU)+
TS= PF 0.5 0 0 0.5 S
TPU→PF V 1 (PF ) + TPU→RU
S
V 1 (RU) + TPU→RF
S
V 1 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 2 (PU) = max(0 + 0.9 × (0.5 × 0 + 0.5 × 0 + 0 × 10 + 0 × 10)),
0 + 0.9 × (1 × 0 + 0 × 0 + 0 × 10 + 0 × 10))
= max(0, 0) = 0

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 37 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 2 A/S S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (PF ) = max(r (PF ) + γ(TPF
A 1
→PU V (PU)+
A 1 A 1 A 1
TPF →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )),
PU PF RU RF S 1
PU 1 0 0 0
r (PF ) + γ(TPF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TPF 1 S 1 S 1
RU 0.5 0 0.5 0 →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )))
RF 0 0 0.5 0.5 V 2 (PF ) = max(0 + 0.9 × (0 × 0 + 1 × 0 + 0 × 10 + 0 × 10)),
0 + 0.9 × (0.5 × 0 + 0 × 0 + 0.5 × 10 + 0.5 × 10))
= max(0, 4.5) = 4.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 38 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 2 A/S S S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (RU) = max(r (RU) + γ(TRU→PU
A
V 1 (PU)+
A
TRU→PF V 1 (PF ) + TRU→RU
A
V 1 (RU) + TRU→RF
A
V 1 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (RU) + γ(TRU→PU V 1 (PU)+
TS= PF 0.5 0 0 0.5 S
TRU→PF V 1 (PF ) + TRU→RU
S
V 1 (RU) + TRU→RF
S
V 1 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 2 (PF ) = max(10 + 0.9 × (0.5 × 0 + 0.5 × 0 + 0 × 10 + 0 × 10)),
10 + 0.9 × (0.5 × 0 + 0 × 0 + 0.5 × 10 + 0 × 10))
= max(0, 14.5) = 14.5

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 39 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 2 (RF ) = max(r (RF ) + γ(TRF
A 1
→PU V (PU)+
A 1 A 1 A 1
TRF →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )),
PU PF RU RF S 1
PU 1 0 0 0
r (RF ) + γ(TRF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TRF 1 S 1 S 1
RU 0.5 0 0.5 0 →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )))
RF 0 0 0.5 0.5 V 2 (PF ) = max(10 + 0.9 × (0 × 0 + 1 × 0 + 0 × 10 + 0 × 10)),
10 + 0.9 × (0 × 0 + 0 × 0 + 0.5 × 10 + 0.5 × 10))
= max(0, 19) = 19

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 40 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2
R= 0 0 10 10 3 2.025 3
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (PU) = max(r (PU) + γ(TPU→PU
A
V 2 (PU)+
A
TPU→PF V 2 (PF ) + TPU→RU
A
V 2 (RU) + TPU→RF
A
V 2 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (PU) + γ(TPU→PU V 2 (PU)+
TS= PF 0.5 0 0 0.5 S
TPU→PF V 2 (PF ) + TPU→RU
S
V 2 (RU) + TPU→RF
S
V 2 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 3 (PU) = max(0 + 0.9 × (0.5 × 0 + 0.5 × 4.5 + 0 × 14.5 + 0 × 19)),
0 + 0.9 × (1 × 0 + 0 × 4.5 + 0 × 14.5 + 0 × 19))
= max(2.205, 0) = 2.025

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 41 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 3 A S
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (PF ) = max(r (PF ) + γ(TPF
A 2
→PU V (PU)+
A 2 A 2 A 2
TPF →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )),
PU PF RU RF S 2
PU 1 0 0 0
r (PF ) + γ(TPF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TPF 2 S 2 S 2
RU 0.5 0 0.5 0 →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )))
RF 0 0 0.5 0.5 V 3 (PF ) = max(0 + 0.9 × (0 × 0 + 1 × 4.5 + 0 × 14.5 + 0 × 19)),
0 + 0.9 × (0.5 × 0 + 0 × 4.5 + 0 × 14.5 + 0.5 × 19))
= max(4.05, 8.55) = 8.55

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 42 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 3 A S S
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (RU) = max(r (RU) + γ(TRU→PU
A
V 2 (PU)+
A
TRU→PF V 2 (PF ) + TRU→RU
A
V 2 (RU) + TRU→RF
A
V 2 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (RU) + γ(TRU→PU V 2 (PU)+
TS= PF 0.5 0 0 0.5 S
TRU→PF V 2 (PF ) + TRU→RU
S
V 2 (RU) + TRU→RF
S
V 2 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 3 (RU) = max(10 + 0.9 × (0.5 × 0 + 0.5 × 4.5 + 0 × 14.5 + 0 × 19)),
10 + 0.9 × (0.5 × 0 + 0 × 4.5 + 0.5 × 14.5 + 0 × 19))
= max(12.025, 16.525) = 16.525

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 43 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 3 (RF ) = max(r (RF ) + γ(TRF
A 2
→PU V (PU)+
A 2 A 2 A 2
TRF →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )),
PU PF RU RF S 2
PU 1 0 0 0
r (RF ) + γ(TRF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TRF 2 S 2 S 2
RU 0.5 0 0.5 0 →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )))
RF 0 0 0.5 0.5 V 3 (PF ) = max(10 + 0.9 × (0 × 0 + 1 × 4.5 + 0 × 14.5 + 0 × 19)),
10 + 0.9 × (0 × 0 + 0 × 4.5 + 0.5 × 14.5 + 0.5 × 19))
= max(14.05, 25.075) = 25.075

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 44 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 4 A
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (PU) = max(r (PU) + γ(TPU→PU
A
V 3 (PU)+
A
TPU→PF V 3 (PF ) + TPU→RU
A
V 3 (RU) + TPU→RF
A
V 3 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (PU) + γ(TPU→PU V 3 (PU)+
TS= PF 0.5 0 0 0.5 S
TPU→PF V 3 (PF ) + TPU→RU
S
V 3 (RU) + TPU→RF
S
V 3 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 4 (PU) = max(0 + 0.9 × (0.5 × 2.025 + 0.5 × 8.55 + 0 × 16.525 + 0 × 25.075)),
0 + 0.9 × (1 × 2.025 + 0 × 8.55 + 0 × 16.525 + 0 × 25.075))
= max(4.75875, 1.8225) = 4.75875

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 45 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 4 A S
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (PF ) = max(r (PF ) + γ(TPF
A 3
→PU V (PU)+
A 3 A 3 A 3
TPF →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )),
PU PF RU RF S 3
PU 1 0 0 0
r (PF ) + γ(TPF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TPF 3 S 3 S 3
RU 0.5 0 0.5 0 →PF V (PF ) + TPF →RU V (RU) + TPF →RF V (RF )))
RF 0 0 0.5 0.5 V 4 (PF ) = max(0 + 0.9 × (0 × 2.025 + 1 × 8.55 + 0 × 16.525 + 0 × 25.075)),
0 + 0.9 × (0.5 × 2.025 + 0 × 8.55 + 0 × 16.525 + 0.5 × 25.075))
= max(7.695, 12.195) = 12.195

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 46 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 18.3475 4 A S S
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (RU) = max(r (RU) + γ(TRU→PU
A
V 3 (PU)+
A
TRU→PF V 3 (PF ) + TRU→RU
A
V 3 (RU) + TRU→RF
A
V 3 (RF )),
S
PU
PU
1
PF
0
RU
0
RF
0
r (RU) + γ(TRU→PU V 3 (PU)+
TS= PF 0.5 0 0 0.5 S
TRU→PF V 3 (PF ) + TRU→RU
S
V 3 (RU) + TRU→RF
S
V 3 (RF )))
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5 V 4 (PF ) = max(10 + 0.9 × (0.5 × 2.025 + 0.5 × 8.55 + 0 × 16.525 + 0 × 25.075)),
10 + 0.9 × (0.5 × 2.025 + 0 × 8.55 + 0.5 × 16.525 + 0 × 25.075))
= max(14.75875, 18.3475) = 18.3475

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 47 / 51

Example
γ = 0.9

k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )

1 0 0 10 10 1
PU PF RU RF 2 0 4.5 14.5 19 2 A/S S S S
R= 0 0 10 10 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 18.3475 28.72 4 A S S S
5 5
PU PF RU RF 6 6
PU 0.5 0.5 0 0
A
T = PF 0 1 0 0
RU 0.5 0.5 0 0
RF 0 1 0 0 V 4 (RF ) = max(r (RF ) + γ(TRF
A 3
→PU V (PU)+
A 3 A 3 A 3
TRF →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )),
PU PF RU RF S 3
PU 1 0 0 0
r (RF ) + γ(TRF →PU V (PU)+
TS= PF 0.5 0 0 0.5 S
TRF 3 S 3 S 3
RU 0.5 0 0.5 0 →PF V (PF ) + TRF →RU V (RU) + TRF →RF V (RF )))
RF 0 0 0.5 0.5 V 4 (RF ) = max(10 + 0.9 × (0 × 2.025 + 1 × 8.55 + 0 × 16.525 + 0 × 25.075)),
10 + 0.9 × (0 × 2.025 + 0 × 8.55 + 0.5 × 16.525 + 0.5 × 25.075))
= max(17.695, 28.72) = 28.72

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 48 / 51

Example

PU PF RU RF
R= 0 0 10 10
γ = 0.9

PU PF RU RF
PU 0.5 0.5 0 0 k V(PU) V(PF) V(RU) V(RF) k π(PU) π(PF ) π(RU) π(RF )
T A= PF 0 1 0 0 1 0 0 10 10 1
RU 0.5 0.5 0 0 2 0 4.5 14.5 19 2 A/S S S S
RF 0 1 0 0 3 2.025 8.55 16.525 25.075 3 A S S S
4 4.75875 12.195 18.3475 28.72 4 A S S S
5 7.62919 15.0654 20.3978 31.1804 5 A S S S
6 10.2126 17.4643 22.6121 33.2102 6 A S S S
PU PF RU RF
PU 1 0 0 0
S
T = PF 0.5 0 0 0.5
RU 0.5 0 0.5 0
RF 0 0 0.5 0.5

Practice
Can you calculate the remaining ones? :)

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 49 / 51

Pros and Cons of Value Iteration

Pros:
Will converge towards optimal values
Good for a small set of states
Cons:
Value iteration has to touch every state in every iteration and so if we have a large number
of total states, value iteration suffers
It is slow because we have to consider all actions at every state, and often, there are many
actions

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 50 / 51

That’s all!
Any questions?

{desmond,kccecia}@ust.hk COMP 2211 (Fall 2022) 51 / 51

Combined Set
No ratings yet
Combined Set
106 pages
Markov Decision Process (MDP)
No ratings yet
Markov Decision Process (MDP)
31 pages
Week 12
No ratings yet
Week 12
59 pages
13 RL 1
No ratings yet
13 RL 1
68 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
59 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Kguh
No ratings yet
Kguh
38 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
G.5 - Learning Burden PPT REV
No ratings yet
G.5 - Learning Burden PPT REV
38 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Unit 4
No ratings yet
Unit 4
56 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Sections
No ratings yet
Sections
76 pages
UNIT 5 - Part 3
No ratings yet
UNIT 5 - Part 3
21 pages
Module 1
No ratings yet
Module 1
72 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
Unit 3
No ratings yet
Unit 3
29 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
What Is Coherence in Writing
100% (1)
What Is Coherence in Writing
3 pages
Body Language - Guide To Reading Body Language Signals in Management, Training, Courtship, Flirting and Other Communications and Relationships
No ratings yet
Body Language - Guide To Reading Body Language Signals in Management, Training, Courtship, Flirting and Other Communications and Relationships
62 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
AI Week 15
No ratings yet
AI Week 15
3 pages
ISOM2700 Practice Set4 Sol
100% (1)
ISOM2700 Practice Set4 Sol
8 pages
Semantic Satiation by Leon James
No ratings yet
Semantic Satiation by Leon James
170 pages
Reinforcement Learning (DQN) Tutorial - PyTorch Tutorials 2.6.0+cu124 Documentation
No ratings yet
Reinforcement Learning (DQN) Tutorial - PyTorch Tutorials 2.6.0+cu124 Documentation
1 page
Shapes Lesson Plan
100% (1)
Shapes Lesson Plan
4 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
Lesson 5 AI
No ratings yet
Lesson 5 AI
38 pages
Reflection Paper On Guidance and Counseling
93% (15)
Reflection Paper On Guidance and Counseling
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
DTVP 2
0% (1)
DTVP 2
2 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Unit 5
No ratings yet
Unit 5
45 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
Ds d79 Diy Solution v1 1tv pb2fp74
No ratings yet
Ds d79 Diy Solution v1 1tv pb2fp74
5 pages
37 RL
No ratings yet
37 RL
18 pages
2 Python Fundamentals Full
No ratings yet
2 Python Fundamentals Full
103 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
3-Internet Services and Mobile Apps
No ratings yet
3-Internet Services and Mobile Apps
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
FS1 - Episode 10 - Delos Santos
No ratings yet
FS1 - Episode 10 - Delos Santos
16 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
15-Mobile and Local Marketing
No ratings yet
15-Mobile and Local Marketing
29 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
Unit 5
No ratings yet
Unit 5
10 pages
CSTP 3 Roth 4
No ratings yet
CSTP 3 Roth 4
7 pages
17-Privacy and Information Rights
No ratings yet
17-Privacy and Information Rights
30 pages
The Mental Calculator: Learn to do Extraordinary Calculations at Lightning Speed
From Everand
The Mental Calculator: Learn to do Extraordinary Calculations at Lightning Speed
Seim Daniel
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
12 Netflix
No ratings yet
12 Netflix
28 pages
EFL Adult Learners' Perception of Learning English Vocabulary Through Pictures at A Private English Center
No ratings yet
EFL Adult Learners' Perception of Learning English Vocabulary Through Pictures at A Private English Center
8 pages
Lisa-Madeline Smith - [PhD Dissertation] Natural Science and Philosophical Hermeneutics_ An Exploration of Understanding in the Thought of Werner Heisenberg and Hans Georg Gadamer-University of Manito.pdf
No ratings yet
Lisa-Madeline Smith - [PhD Dissertation] Natural Science and Philosophical Hermeneutics_ An Exploration of Understanding in the Thought of Werner Heisenberg and Hans Georg Gadamer-University of Manito.pdf
209 pages
ENG103 Course Outlinefor Fall 2022
No ratings yet
ENG103 Course Outlinefor Fall 2022
6 pages
10 Minimax Full
No ratings yet
10 Minimax Full
55 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
1 Intro
No ratings yet
1 Intro
27 pages
DW 01
No ratings yet
DW 01
14 pages
5-Porter's Five Forces
No ratings yet
5-Porter's Five Forces
18 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Work Motivation: Principles and Applications Damodar Suar
No ratings yet
Work Motivation: Principles and Applications Damodar Suar
28 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
ISOM2700 Practice Set5 Sol
No ratings yet
ISOM2700 Practice Set5 Sol
16 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
30 pages
Schlechty Engagement
No ratings yet
Schlechty Engagement
4 pages
XRMB Handbook
No ratings yet
XRMB Handbook
139 pages
Cheatsheet 1
No ratings yet
Cheatsheet 1
1 page
GRAMMAR REVIEW. Glenda Pira
No ratings yet
GRAMMAR REVIEW. Glenda Pira
6 pages
A Beginners Guide To Deep Reinforcement Learning PDF
No ratings yet
A Beginners Guide To Deep Reinforcement Learning PDF
9 pages
After Class Quiz #5 - Sol
No ratings yet
After Class Quiz #5 - Sol
8 pages
Preparing For The Job Interview PDF
No ratings yet
Preparing For The Job Interview PDF
4 pages
0 Logistics Full
No ratings yet
0 Logistics Full
31 pages
After Class Quiz #1 Sol (Updated)
No ratings yet
After Class Quiz #1 Sol (Updated)
4 pages
Work Sheet - Topic (Plant Sensitivity) (1) 1090
No ratings yet
Work Sheet - Topic (Plant Sensitivity) (1) 1090
5 pages
Types of Assessment
No ratings yet
Types of Assessment
5 pages
Daily Lesson Plan Ts25
No ratings yet
Daily Lesson Plan Ts25
2 pages
Generative AI Blog
No ratings yet
Generative AI Blog
2 pages
Sound Cylinders MC
No ratings yet
Sound Cylinders MC
3 pages
Unit 104 Engineering Perspectives and Skills
No ratings yet
Unit 104 Engineering Perspectives and Skills
10 pages
Phenomenology
No ratings yet
Phenomenology
5 pages
How To Teach Speaking
No ratings yet
How To Teach Speaking
25 pages
Lesson 2 - Models of Organizational Behavior: Collegial and System
No ratings yet
Lesson 2 - Models of Organizational Behavior: Collegial and System
8 pages
German Dataset Tasks
No ratings yet
German Dataset Tasks
6 pages
Literature Review GTM
No ratings yet
Literature Review GTM
2 pages