Lecture 19 35
Lecture 19 35
CMPUT 365
Introduction to RL
Marlos C. Machado Class 19/35
CMPUT 365 – Class 17/35
2
Reminder I
You should be enrolled in the private session we created in Coursera for CMPUT 365.
I cannot use marks from the public repository for your course marks. You need to check,
every time, if you are in the private session and if you are submitting quizzes and
assignments to the private section.
At the end of the term, I will not port grades from the public session in Coursera.
If you have any questions or concerns, talk with the TAs or email us
[email protected].
Marlos C. Machado
CMPUT 365 – Class 19/35
3
Reminder II
● Exam viewing on Tuesday and Wednesday
○ Tuesday: 2pm - 5pm in ATH 3-28
○ Wednesday: 3pm - 5pm in ATH 3-32
Marlos C. Machado
CMPUT 365 – Class 17/35
4
Marlos C. Machado
CMPUT 365 – Class 19/35
7
Marlos C. Machado
CMPUT 365 – Class 19/35
9
Marlos C. Machado
CMPUT 365 – Class 19/35
10
Marlos C. Machado
CMPUT 365 – Class 19/35
11
● TD update:
Marlos C. Machado
CMPUT 365 – Class 19/35
12
Marlos C. Machado
CMPUT 365 – Class 19/35
14
Marlos C. Machado
CMPUT 365 – Class 19/35
15
Marlos C. Machado
CMPUT 365 – Class 19/35
16
Marlos C. Machado
CMPUT 365 – Class 19/35
17
Marlos C. Machado
CMPUT 365 – Class 19/35
19
Optimality of TD(0)
● Under batch training, constant-α MC converges to values, V(s), that are sample
averages of the actual returns experienced after visiting each state s. These are
optimal estimates in the sense that they minimize the mean square error from
the actual returns in the training set.
● Bath TD(0) gives us the answer that it is based on first modeling the Markov
process and then computing the correct estimates given the model (the
certainty-equivalence estimate).
Marlos C. Machado
CMPUT 365 – Class 19/35
20
Example
V(A) = ?
V(B) = ?
Marlos C. Machado
CMPUT 365 – Class 19/35
21
Example
TD MC
V(A) = ? ¾ or 0?
V(B) = ¾
Marlos C. Machado
CMPUT 365 – Class 19/35
22
TD vs Monte Carlo
“Batch Monte Carlo methods always find the estimates that minimize mean square
error on the training set, whereas batch TD(0) always finds the estimates that would
be exactly correct for the maximum-likelihood model of the Markov process.”
Marlos C. Machado
CMPUT 365 – Class 19/35
23